A Random Forest Model Building Using A priori Information for Diagnosis Sergey Subbotin1[0000-0001-5814-8268] 1 Zaporizhzhia National Technical University, Zhukovsky str., 64,Zaporizhzhia, 69063, Ukraine subbotin@zntu.edu.ua Abstract. The problem of inductive model building on precedents for biomedi- cal applications is considered. The model paradigm is a random forest as a set of decision tree classifiers working as ensemble. The apriori information taken from training data set is used in proposed method of random forest model build- ing provide more accurate model saving general random character of a method. The resulting random forest provide more accurate model in comparison with a single decision tree, but its comparison with known methods of random forest model building proposed method is more accurate. Keywords: medical diagnosis, decision tree, random forest, training, inductive learning 1 Introduction The decision trees [1-2], which are hierarchical tree-like models for obtaining a deci- sion on assigning a recognized instance to one of the possible classes, are useful for problems of biomedical diagnosis by precedents. However, the known methods of constructing models based on decision trees [1-24] are not always able to provide the required level of classification accuracy. The paradigm of a random forest [25-31] is used to improve the accuracy of models based on decision trees. Random forest paradigm [25-31] is committee (ensemble) of decision trees and combines two main ideas: the Bagging method of L. Breiman [32, 33] and the method of random subspaces of T. Ho [26]. Generally, the idea of the method is to use a large ensemble of decision trees, each of which in itself gives not a lot of classification quality, but together joined they provide a good result due to the large number of predictors [25]. Decision trees are a good family of basic classifiers for bagging [32], since they are quite complex and can reach zero error on any sample. The method of random subspaces [26] allows to reduce the correlation between trees and avoid re- training. Basic methods are trained on different subsets of the features, which are also randomly distinguished. The random forest paradigm has such advantages [25-31] as: the ability to effi- ciently process data with a large number of features and classes, insensitivity to scal- ing (and in general to any monotonic transformations) of feature values, resistance to the inclusion of irrelevant (useless) features, the ability to process both continuous and discrete features, the ability to work with missing data, possibility to be used for feature significance estimation, insensitivity to emissions in data due to random sam- pling, possibility of evaluating the resulting model ability to generalize (test for out- of-bag - unselected instances), high parallelizability and scalability, high prediction accuracy, the ability to balance the weight of each class on the entire sample, or on a subsample of each tree, low tendency to retraining (in practice trees almost always only improves composition, but upon validation, after reaching a certain number of trees, the learning curve goes asymptote), the random forests by modifying their defi- nitions can be represented as nuclear methods, which are more convenient and inter- pretable for analysis, random forest can be converted into the k-nearest neighbors model. However, a random forest has such disadvantages as [25-31]: not very high accu- racy of models, lower interpretability of the model compared to a single tree, lack of formal conclusions (p-values) available for assessing the feature importance, poor quality of work for samples containing a large number of sparse features, a tendency to retraining on noisy data, inability to extrapolate, model bias for data including cate- gorical variables with different amount levels, in favour of features with a large num- ber of levels (when a feature has many levels, the tree will be more adaptable to these features, since they can get a higher value of the optimized functional such as infor- mation growth), tendency to give preference to small groups of correlated features that have similar significance for tags to large groups, a large size of the resulting models - the spatial complexity is estimated as O(ST) for storing the model, where S is the volume of the initial sample, T is the number of trees, the large time spent on building the model compared to single decision tree. The aim of this paper is to improve random forest building method preserving its random character, but concentrating the method on such solutions, which seems to be more convenient to increase model accuracy using the a priori information extracted from the training sample. 2 Formal problem statement Let we have a training set of S precedents (observations, instances, cases) , where x = {xs}, y={ys}, s = 1, 2, ..., S, xs is an s-th instance of the sample, ys is an out- put feature value associated with the s-th instance, xs = {xsj}, j = 1, 2, ..., N, xsj is a value of the j-th input feature of the s-th instance, N is a number of input features. Then the general model constructing task for the dependence y=f(w, x) is to find such a model structure f and such values of model parameters w for which the model quality criterion F is satisfied. As a rule, for the problems of approximation the model quality criterion is determined as a function of the model error (1): S  E   y s  f ( w, x s )  0. 2 (1) s 1 The decision tree model structure consists of nodes and links (connections between nodes). The decision tree model parameters are the numbers of features used in tree nodes, as well as their boundary values for splitting the ranges of feature values. A random forest is a collection of trees as structures with parameters, as well as a transformation that combines the results of trees wok and the weight of trees in mak- ing the final decision. 3 Literature review The methods of decision tree constructing [1-24] hierarchically divide the initial space into regions, in which they evaluate the output average value for the instances hit in the region (cluster). This value is assigned to the output feature of all instances hitting in this cluster. The advantage of this group of methods is the simplicity and interpretability of the resulting models, as well as the possibility of passing the cluster analysis tasks and selecting informative features. The disadvantages of the methods of this group are the low accuracy of the obtained models. The random forest method aims to improve classification accuracy joining set of separately generated random models. This makes possible to approximate the inter- class boundary more accurately in comparison with single decision tree. Let the training sample consist of S instances, the dimension of the feature space is N, and the parameter m (the number of selected features) is specified. Set the number of individual models (trees) in the ensemble (forest) T as well as the maximum ac- ceptable number of instances in the tree node or the maximum allowable tree height as a stop criterion Z. The most common way to build a committee's tree is called bagging (short for bootstrap aggregation): To build the next t-th model in a form of decision tree (t = 1, 2, ..., T): – set the number of features for the t-th model mt . 2. Estimate the a priori information: – determine individual estimates of the informativeness of each j-th input feature [34-46] respectively to the output feature I j , j = 1, 2, ..., N; – determine individual estimates of the informativeness of each i-th input feature [34-46] respectively to the j-th input feature I i , j , j = 1, 2, ..., N; s – estimate the individual informativeness of each sample instance I [35, 44], s = 1, 2, ..., S; – determine the measure of the distance between the s-th and p-th instances of the sample in the feature space d s , p . For small size samples evaluate by (4): N d s , p  d p , s   ( x sj  x jp ) 2 , s = 1, 2, ..., S, p = 1, 2, ..., s. (4) j 1 For large samples, instead of the distance between instances, it is possible to use the distances between their locally sensitive hashes [47] ealuated as (5): d s , p  H s  H p , s = 1, 2, ..., S, p = 1, 2, ..., s, (5) where Hs is a locally sensitive hash for s-th instance. Calculation of hashes, unlike calculation of distances, will not require loading all instances into computer memory: hashes can be calculated in just one sample pass, operating with fragments in memory. 3. Building a forest of T decision trees. To build the t-th model as a decision tree (t = 1, 2, ..., T): – set the number of attributes for the t-th model mt