Joint use of neural network technologies and decision trees for logical patterns exploration in data V.N. Gridin1, V.I. Solodovnikov1 1 Design information technologies Center RAS, Str. Marshal Biryuzov 7а, 143000, Odintsovo, Moscow Region, Russia Abstract The issues of joint use of neural network technologies with methods of logical deduction and decision making support in data mining tasks are considered. The analysis of searching for logical patterns algorithms, their advantages and disadvantages is carried out. The description of combined algorithms for rules extraction from the trained neural networks and presentation the result in the form of a hierarchical, sequential structure of "if-then" rules is given. The representation of decision trees in the form of the semantic network facts is considered. Keywords: neural network; decision trees; logical conclusion; rules extraction; date mining 1. Introduction Data is a valuable resource, which contains a great potential opportunities for the extraction of useful analytical information. Therefore, the tasks of revealing hidden regularities, developing decision making strategies, forecasting, which requires more detailed consideration of the logical patterns exploration in classification problems, are becoming increasingly important. The peculiarity of algorithms and methods applicable for solving the data mining problems is the absence of a priori assumptions about the sampling structure and the distributions type of the analyzed indicators values. One of the closest correspondences to this condition could be the usage of an approach based on neural network technologies. This is due to the ability of neural networks for nonlinear processes modeling, working with the extremely complex dependencies, adaptability to the functioning conditions, and most importantly, the ability to extract and generalize essential features from incoming information. Thus, the network constructs rules, but these rules are contained in weighting coefficients, activation functions, and neuronal connections, but usually their structure is too complex to perceive and determine the effect of a particular characteristic on the output value. The neural network, in fact, acts as a "black box", the input of which is supplied with the initial data and the certain output result is obtained, however, it is not provided any rationale why this decision was made. To solve this problem, it is proposed the joint use of the neural network technologies with logical deduction methods, in particular decision trees, as a means of decision- making support, logical patterns exploration and the result presentation in the form of a hierarchical structure of classifying rules. And the use of semantic networks provides additional opportunities in construction of the deduction mechanisms and presentation the decision-making process. 2. Searching for logical patterns in the data We will understand by logical regularity an easily interpreted rule that allocates a lot of objects of one class from the training sample and practically does not allocate objects of other classes. Logical patterns are elementary "building blocks" for a wide class of classification algorithms. Rules, that express regularities, are formulated in the language of first-order logic predicates and have the following form: IF (condition_1) AND (condition_2) AND … AND (condition_N) THEN (conclusion), where condition i could be x i  с1 , x i  с 2 , x i  с 3 , c 4  x i  с 5 etc., x i - variable, с1 , с 2 , с3 , с 4 , с5 – some constants. For nominative data, the following predicates are used: «=» and «<>». 2.1. The limited search algorithms The limited search algorithms are used for logical regularities search in data, for solving classification and forecasting problems [1]. The main idea of this method is to analyze the frequency of occurrence of various combinations of simple logical events. At the initial stages, short associative chains are searched for, which are complicated in the process of the system's functioning, by adding new elements to them. Based on the analysis, the system makes a conclusion about the usefulness of this or that combination and, thus, establishes the logical patterns in the data. Its main disadvantage is the fact that this algorithm is capable in an acceptable time to find a solution for only a small dimension data. 2.2. Decision trees Decision trees relate to the methods of logical regularities searching in data, and are the main approach applicable in decision making theory. They represent the hierarchical structure of "if-then" classifying rules, which have the form of a tree. Their main advantage is the simplicity and clarity of the decision-making process description. The disadvantage of their use in the problem of logical patterns search is the fact, that they are not able to find the most complete and accurate rules in the data and only 3rd International conference “Information Technology and Nanotechnology 2017” 5 Data Science / V.N. Gridin, V.I. Solodovnikov implement the simplest principle of sequential viewing of attributes and form fragments of regularities. Also, for large volumes of multidimensional data, these algorithms can produce a very complex tree structure that has many nodes and branches. Such trees could be very difficult for analyzing and understanding. Accordingly, the rules and patterns discovered by such a tree would be difficult for comprehension. In addition, a branchy tree with many nodes divides the training set into a large number of subsets consisting of a small number of objects. While it is much more preferable to have a tree with a small number of nodes, for each of which correspond a large number of objects from the training sample. To solve this problem, branch cutoff algorithms are often used [2], but they can not always lead to the desired result. However, methods of searching regularities with the help of decision trees allow us to find such connections that are concluded not only in certain features, but also in a combination of features, which in many cases gives these methods a significant advantage over classical methods of multivariate analysis. Figure 1 shows an example of such a decision tree, and the corresponding logical deduction, where 1,2,3 - predicates, x, y, z – variables, ,, - constants. x 1 2 1 ( x, )  ( y  A) y=A z  2 ( x,  )   3 ( z ,  )  ( y  B ) 3 y=B Fig. 1. An example of a decision tree. Rules that express regularities are formulated in the form of expressions: «IF A THEN B» or in the case of a set of conditions: «IF (condition 1)  (condition 2)  …  (condition N) THEN (the output node value)». The decision trees construction is usually carried out by following ways:  on the expert assessments basis;  using sample processing algorithms (CLS, ID3 (Interactive Dichotomizer), С4.5, CART (classification and regression trees) etc. );  using genetic algorithms and evolutionary programming. Each of these approaches has its advantages and disadvantages and can be used to solve its specific tasks. 2.3. Genetic algorithms The most difficult problem in search for logical regularities in data sets is to find the elementary events, representing the terms of the conditional part "IF". At present, genetic algorithms (GA) are increasingly used to solve this problem, which include algorithms: Bucket-Brigade, REGAL, G-NET, HIDER, SIAO1 and some others. However, they are not without a number of drawbacks: a fixed set of rules and their length, as well as accuracy and completeness in most of them are not taken into account. 2.4. Neural network methods The usage of the approach based on neural network data processing technologies is caused by the ability of neural networks to model non-linear processes, act with extremely complex dependencies, adaptate to operating conditions, work with noisy data and with the lack of a priori information. And most importantly, they are able to learn from experience, generalize previous precedents into new cases and extract significant features from incoming information. Thus, the network constructs rules, but these rules are contained in weighting coefficients, activation functions, and neuronal connections, but usually their structure is too complex to perceive. Moreover, these parameters can represent non-linear, non-monotonic relationship between the input and target values in a multilayer network. Thus, as a rule, it is not possible to separate the effect of a certain attribute to the target value, cause of this effect can be mediated by the values of other parameters. The neural network, in fact, acts as a "black box", the input of which is supplied with the initial data and the certain output result is obtained, however, it is not provided any rationale, why this decision was made. 3. Getting logical patterns from a trained neural network Let the problem consists in the classification of a certain set of data with the help of a perceptron and the subsequent analysis of the obtained network in order to find the classifying rules that characterizes each of the classes. First, let's consider this problem with the example of a single-layer perceptron, which consists of five Boolean inputs and one output neuron. This network can be accurately interpreted by a finite number of "if-then" rules, since a finite number of possible input vectors are defined for it. 3rd International conference “Information Technology and Nanotechnology 2017” 6 Data Science / V.N. Gridin, V.I. Solodovnikov x1 w1 x2 w2 S   wi  xi i w3 S F (S ) x3 1 if S   y  F ( S )   x4 w4 0 else x5 w5 Fig. 2. Single-layer perceptron with five Boolean inputs and one output. Let the weights take on the following values: w1  6 , w2  4 , w3  4 , w4  0 , w5  4 , and the bias   9 . In this case, the following set of rules can be extracted from the network: x1  x2  x3  y x1  x2  x5  y x1  x3  x5  y Thus, the decision-making procedure is to predict the value y  true , if the activation of the output neuron is 1, and y  false , if the activation is 0. Generally speaking, it is possible to distinguish two approaches for extracting rules from the multilayer neural networks [3]. The first approach is to extract a set of global rules that characterize the output classes directly through the values of the input parameters. An alternative is to extract local rules by separating a multi-layer network into a collection of single-layer networks. Each extracted local rule characterizes a separate hidden or output neuron, taking into account elements that have weighted connections with it. Then all got rules are combined into a set that determines the behavior of the entire network as a whole. The local approach is illustrated at Figure 3. y y h1 h2 h3 x1 x2 x3 x4 x5 Fig. 3. A local approach for extracting rules. Multilayer neural network is divided into a set of single-layered. Rules for describtion of each component are extracted, which are combined into a set that characterizes the multi-layer network. Let's consider the problem of extracting rules in a more general form. Let X denote a set of n properties X 1 , X 2 ,..., X n , and { xi } is the set of possible values that a property X i can take. And C denotes the set of classes c1 , c2 ,..., cm . The associated pairs of input and output vector values are known for the training sample ( x1 ,..., xn , c j ) , where c j  C . 3.1. Local approach for rules extraction NeuroRule is one of the algorithms for extracting rules from neural networks, which were trained to solve the classification problem [4]. This algorithm includes three main steps: Step 1. Neural network training. At the first stage, a two-layer perceptron is trained until sufficient classification accuracy would be obtained. At the initial time, a large number of the hidden layer neurons are selected. Unnecessary neurons and connections would be discarded after training. Step 2. Thinning of a neural network. 3rd International conference “Information Technology and Nanotechnology 2017” 7 Data Science / V.N. Gridin, V.I. Solodovnikov The trained neural network contains all possible connections between input neurons and hidden layer neurons, as well as between hidden and output neurons. Usually the total number of these links is so large that it is impossible to extract observable for user classifying rules from their values analysis. Thinning consists of removing unnecessary connections and neurons, which absence does not increase the network classification error. The resulting network usually contains much less neurons and connections between them, and the operation of such a network is able to be investigated. Step 3. Rules extraction. At this stage, the rules that take form of «IF ( x1q1 ) AND ( x2 q2 ) AND ... AND ( x n q n ) THEN c j » are extracted from the thinned neural network. Here q1 ,..., q n are constants and  is the relational operator ( , , ,  ). First the preparation for rules extraction is taking place, which includes coding of all continuous quantities for input and interior network values. Also the coding process is performed for features of the classified objects if they have continuous values. To represent them, it is possible to use binary neurons and a coding principle such as a thermometer. The resulting values of the hidden layer neurons are clustered and replaced with values that determine the centers of this clusters. It is important to select a small number of such clusters. The objects classification accuracy by the network is checked, after such discretization of the hidden neurons functionality. If it remains acceptable, than the preparation for rules extraction comes to the end. Further, the rules extraction is taking place, during which the movement through the network occurs from the classifying output neurons to the network inputs. It is assumed that these rules are fairly obvious while verified and are easily could be applied to the large databases. However, this algorithm establishes rather strict limitations on the architecture of the neural network, the number of elements, connections and the type of activation functions. So for the hidden neurons the hyperbolic tangent is used and their states change in [-1,1] interval, and for the output neurons the Fermi function with the state interval [0,1] is applied. 3.2. Global rules extraction The lack of universality and scalability could be mantioned as the main drawbacks of most algorithms of rules extraction. In this regard, TREPAN algorithm [5] gets the most interest. It lacks these shortcomings and does not impose any requirements to the network architecture, input and output values, learning algorithm, etc. This approach builds a decision tree on the base of knowledge embedded in the trained neural network, and it is enough that the network is a kind of "black box", "expert" or "oracle", to whom it is possible to ask questions and get answers. Moreover, this algorithm is sufficiently universal and can be applied to a wide range of other trained classifiers. It also scales well and has no sensitive to the input attributes dimension and the size of the network. This algorithm builds the decision tree, that approximates the functionality of the trained neural network, and consists of two following stages. Preliminary stage: 1. Construct and train a neural network that will later act as an "Expert" or "Oracle". 2. Initialize the root of the tree R as a leaf. 3. Use the entire training set of examples S to construct a distribution model M R of the input vectors that reach the node R. Compute value q = max(0, minSamples - |S|), where minSamples is the minimum number of training examples used in each node of the tree, S is the current training sample (|S| is the training sample volume). Thus, q is the number of additional examples that need to be generated. 4. q new learning examples are generated randomly on the base of the attributes distribution evaluation from S. queryR is the set of q examples generated by the model M R . 5. Use neural network as an "Oracle" to classify both new queryR and old examples from the set S to a particular class. For each vector of attributes x  ( S  queryR ) , put a class label x  Oracle(x) . 6. Initialize the Queue , by placing the set R, S , queryR ,{empty _ constr} . Main stage: 7. Take the next set N , S N , queryN , constrN from beginning of the Queue , where N is the node of the tree, S N is the training sample in the node N , constrN is a set of restrictions on the certain attributes of the training examples for reaching the node N . 8. Use F , S N , query N for construction branching T in a node N . Here F is a function for estimating the node N. It has the following form F ( N )  R( N )  (1  f ( N )) , where R(N ) is the probability of reaching node N by an example, and f (N ) is the correctness evaluation of these examples processing by a tree. Thus, the best node is chosen, which branching has the greatest impact to the classification accuracy of the generated tree. The separation of the examples, which reach this internal node of the tree, is carried out depending on the m  of  n test [5,6]. Such a test is considered to be passed when it is satisfied, at least m from n conditions. On the other hand, it is possible to split the set S as in the usual algorithm for decision tree construction. 9. Create next-generation nodes for each branch t of branching T: a. Create C as a new child node in a relation to N . 3rd International conference “Information Technology and Nanotechnology 2017” 8 Data Science / V.N. Gridin, V.I. Solodovnikov b. Add a restriction from the branch t to constrC  constrN  {T  t} . c. Generate set S C , which contains examples from the set S N that satisfy the condition on the branch t. d. Construct a model M C for examples distribution that reach the node C . Calculate the number of examples to generate q = max(0, minSamples - | S C |). e. q new learning examples are randomly generated on the base of the characteristics distribution evaluation from S C and constraint values constrC . queryC is a set of q examples, which were generated by the model M C and constraint constrC . f. Use neural network as an "Oracle" to classify new examples x  queryC and expose the class label x  Oracle(x) . g. Initially, it is assumed that the node C is a leaf. Use S C and queryC to determine the class label for C. h. Check the necessity of the further branching of the node C. Put the set < C , S C , queryC , constrC > into the Queue if the local stop criterion is not satisfied. A local criterion in this case is the probability that in a given node there are instances of one class. 10. If the Queue is not empty and the global stop criterion is not fulfilled, then go to step 7, otherwise return the tree with the root R . The maximum tree size and the overall classification quality evaluation of examples by a tree are used as the global criterion for completing the algorithm. The generalization ability of artificial neural networks, which allows to obtain more simple decision trees is the main advantage of this approach. In addition, the applying of such an "Oracle" allows to compensate the lack of data, which is usually could be observed at lower levels during the decision trees construction by the sample processing algorithms. Thus, it is possible to extract structured knowledge not only from extremely simplified neural networks, but also from arbitrary classifiers that makes possible the appliance of this algorithm in a wide range of practical problems. 4. Rules representation in a form of a semantic network A simple semantic network, sometimes called a computational semantic network, is actually a bipartite graph. Lets there are a finite set A = {Al,. . ., Ar}, which is called attributes, and the finite set R = {R1,. . ., Rn} of relations. The scheme or intensional of the ratio Ri ( i  1,..., n ) is the set of pairs: INT(Ri) = { . . ., [Aj, DOM(Aj)], . . .}, where Ri is the name of the relation, DOM(Aj) is the domain of Aj ( j  1,..., r ),i.e. the set of attribute Aj values of the relation Ri. The union of all domains is called the base set of the model or the set of objects, on which the relations R are specified. An extensional of Ri relation is the set: EXT(Ri) = {F1, . . ., Fp}, where Fk ( k  1,..., p ) is the fact of the relationship Ri. The fact is set by an aggregate of attribute-value pairs, called attribute pairs. Fact is a concretization of a certain relationship between the specified objects. In a graphical interpretation, fact is a subgraph of a semantic network that has a star-shaped structure. The root of a subgraph is a vertex of a predicate type, labeled with a unique label that includes the name of the corresponding relationship. From the vertex of the fact connections are coming out, which are marked with the names of attributes of this fact. They are leading to the vertices of the base set and are the values of these attributes. It is worth noting that the semantic network can be represented as a storage of facts that were derived from the decision trees processing, i.e. the decision tree is transformed into a semantic network. In this case, each fact is presented in the form of a ready-made deduction output. This provides additional opportunities for analysis. For example, even the usual means of databases can find the facts that relate to different relationships but have the same attributes and values that characterize them. To store semantic networks in the database, or rather their extensional, it is possible to use a table of the form: Table 1. Table of extensional. Field name Data type Field Properties CodeValue Numeric Key field FactMark Numeric The fact mark FactAttributes Text Attribute of fact AttributeValue Text Attribute Value Thus, the decision tree view will be obtained in the form of a fact table, where a record with the fact attributes values exists for each value of the output deduction variable. Since there is a mapping of one representation to another, so from the formal point of view these representations are identical. The conclusions could be drawn in a semantic network, which are far from obvious for a decision tree. Let's take this simple example. The decision tree determines the conclusion about the establishment of the parental relations in the first generation. It is necessary to define the parent relationship in the second generation. The solution of this problem for the semantic network is 3rd International conference “Information Technology and Nanotechnology 2017” 9 Data Science / V.N. Gridin, V.I. Solodovnikov obvious. It is necessary to highlight the facts of parental relations, then to delete objects from the intensional that do not match the parents and repeat the process of highlighting the facts of the parent relationship. This means that it is possible in the semantic network introduce means of constructing functional dependencies similar to how it is done in functional programming languages, for example, such as LISP [7]. You can consider the fact as a list of atoms, each of which is assigned a value either directly or in the process of output. If operations for such lists manipulation are entered then it is possible to create and modify decision trees with formal methods. Thus, it seems advisable to combine in a single system a representation in the form of decision trees and a class of semantic networks, which makes it possible to visually display the decision-making process and gives additional possibilities in constructing the deduction mechanisms. 5. Conclusion In this paper, the problems of logical regularities search in the classification tasks were considered. A joint use of neural network technologies with logical deduction methods, in particular decision trees, as a means of logical patterns exploration and the result presentation in a hierarchical structure form of classifying rules, is proposed. Two main approaches are identified. The first is to extract local rules, where the multilayer network is divided into a set of single-layered. Each local rule characterizes a separate hidden or output neuron, taking into account elements that have weighted connections with it. Then the rules are combined into a set that determines the behavior of the entire network as a whole. However, this approach often establishes fairly strict limitations on the network architecture, the number of elements, links and the type of activation functions, which negatively affects the universality and scalability. An alternative is to extract a set of global rules that characterize classes at the output directly through the values of the input parameters. In the framework of this approach a modified algorithm for decision trees construction on the base of the trained neural networks is developed. It does not impose any requirements on architecture, learning algorithm, input and output values and other network parameters. The construction of the tree is base on knowledge that embedded into the trained neural network, and it is enough that the network is a kind of "Black Box" or "Expert", for which it is possible to ask questions and get answers. The generalization ability of artificial neural networks, which allows to obtain more simple decision trees and, if it is necessary, to compensate the lack of initial data are the main advantages of this approach. Moreover, this algorithm is sufficiently universal, well scalable and not sensitive to the input attributes dimension and the size of the network. This circumstance acquires special significance in the light of the rapid development of deep learning technology. Thus, it is possible to extract structured knowledge not only from extremely simplified neural networks, but also from arbitrary classifiers that makes possible the appliance of this algorithm in a wide range of practical problems. In addition, an algorithm for converting already formed decision trees into semantic networks in the form of a bipartite graph has been developed. This provides a solution tree view in the form of a fact table and allows a quick search by known attributes. The automation tools introduction into the data mining systems could shorter the time, improve the quality and effectiveness of the decisions making. Acknowledgements Work is carried out with the financial support of the RFBR, the project 15-07-01117a. References [1] Dyuk V, Samojlenko A. Data Mining. SPb: Piter, 2001; 368 p. [2] BaseGroup Labs company WebSite, Trees of decisions - general principles of work. URL: https://basegroup.ru/community/articles/description (10.01.2017). [3] Gridin VN, Solodovnikov VI, Evdokimov IA, Filippkov SV. Building decision trees and extracting rules from trained neural networks. Iskusstvennyj intellekt i prinyatie reshenij 2013; 4: 26–33. [4] Ezhov AA, Shumskij SA. Neurocomputing and its application in economics and business. M.: MIFI, 1998; 224 p. [5] Craven MW, Shavlik JW. Extracting tree-structured representations of trained networks. Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA 1996; 8: 24–30. [6] Murphy PM, Pazzani MJ. ID2-of-3: Constructive induction of M-of-N concepts for discriminators in decision trees. Proceedings of the Eighth International Machine Learning Workshop, Evanston, IL 1991; 183–187. [7] Hyuvenen EH, Seppyanen I. Mir LISPA. Methods and systems of programming. M.: Mir, 1990; 320 p. 3rd International conference “Information Technology and Nanotechnology 2017” 10