Data Science FEATURE SELECTION IN THE EFFECTIVENESS RESEARCH OF A TRAINING PROGRAM FOR PATIENTS WITH THE ATRIAL FIBRILLATION V.V. Kutikova1, A.V. Gaidel1,2, A.G. Khramov1,2 1 Samara National Research University, Samara, Russia 2 Image Processing Systems Institute, Russian Academy of Sciences, Samara, Russia Abstract. We investigated training therapeutic program effectiveness of the school “Stop a Stroke”, which aimed at reducing a risk of a stroke for patients with the atrial fibrillation. On the basis of two feature selection methods using a criterion of the discriminant analysis to determine the best feature subset, we concluded that patients, who trained at the school, in contrast to patients, who have not received training, take anticoagulants for a long time and have a higher level of knowledge about the atrial fibrillation. Keywords: data mining, feature selection, discriminant analysis criterion. Citation: Kutikova VV, Gaidel AV, Khramov AG. Feature selection in the ef- fectiveness research of a training program for patients with the atrial fibrilla- tion. CEUR Workshop Proceedings, 2016; 1638: 902-908. DOI: 10.18287/1613-0073-2016-1638-902-908 1 Introduction Reducing the dimensionality of a feature space is one of the central issues in data mining. For the most classification and recovery regression problems it is necessary to select the best subset from a given feature set. This is because the use of a large fea- ture number is not only computational expensive, but also affects the recognition accuracy, as irrelevant and redundant features, which complicate the decision-making process, can be used. Feature selection methods are commonly used for the biomedical data analysis. In the work [1] using the method based on the ANCOVA an 80-gene biomarker of the smokers lung cancer was identified from 22216 features describing the expression levels of different genes. The accuracy, sensitivity and specificity of this biomarker were 83 %, 80% and 84%, respectively. For the selection of a small number of fea- tures one can use the brute force [2]. In biomedical data mining the sequential search algorithms [3] and genetic algorithms [4] are also used. Feature selection methods based on a criterion of the discriminant analysis have showed their effectiveness in [5, 6] for the analysis of biomedical images. Information Technology and Nanotechnology (ITNT-2016) 902 Data Science Kutikova VV, Gaidel AV, Khramov AG… The aim of this work is the research of the effectiveness of the training therapeutic program of the school "Stop a Stroke", which aimed at reducing a risk of a stroke for patients with the atrial fibrillation. The used dataset contains observations of 12 fea- tures and classes of the observations are patient groups, namely: a main group and a comparison group. Patients of the main group have been trained at the school, and patients of the comparison group visited to a doctor, but have not received training. The data were obtained during two visits of patients to the doctor, namely: before (the first visit) and after (the second visit) the training course. In order to evaluate the effectiveness of the training course, first of all, a feature sub- set, which distinguish the original data into two groups in the best way, is selected on the basis of data from the second visit; then, the performances of the obtained feature subset for two visits are compared; finally, conclusions about the effectiveness of the training course are draw. As the main research tools two feature selection methods are used. The first method is based on the quality estimation of separate features using the discriminant analysis criterion and the second method allows to estimate the quality of feature subsets based on the same criterion. 2 Methods 2.1 Feature ordering in correspondence with the discriminant analysis criterion According to the described in [4] method, features are ordered in correspondence with the discriminant analysis criterion [7] tr S m J , (1) tr S w where tr S w is the trace of a within-class scatter matrix, tr S m is the trace of a mixture scatter matrix. The within-class scatter matrix shows the scatter of samples around their respective class expected vectors:    2 S w   pi M  X (i )  M i X (i )  M i  , T i 1   where p i is a class prior probability, X (i ) is a feature vector from the i-th class, M i is an expected vector of the i-th class. The mixture scatter matrix is the correlation matrix of all feature vectors regardless of their class:  S m  M  X  M 0  X  M 0 T , Information Technology and Nanotechnology (ITNT-2016) 903 Data Science Kutikova VV, Gaidel AV, Khramov AG… where М 0  p1M 1  p2 M 2 is the expected vector of the mixture distribution. The higher criterion value (1) is, the better a feature distinguishes the samples from different classes. 2.2 Sequential search of the best feature subset The previously described approach to feature selection allows to assess the perfor- mance of each feature separately, but does not take into account dependencies be- tween the features. Some of them can be useless by themself, but effective when combined with other features. A sequential algorithm provides a search of the best features on the space of feature subsets. The basic idea is that starting from some initial subset on each step we move to the next state, in which one element is included into the current feature subset or excluded from it. Let F be a set of all features, which participated in the selection, X be a current best feature subset of F and Y be a subset of the rest features that is Y  F \ X . The se- quential algorithm consists of the following steps: 1. The choice of an evaluation function to measure the performance of a feature sub- set (in this work it is the criterion (1)), stopping criterion (the dimension of the cur- rent set X is equal to the dimension of the set F) and some initial subset (let it be the empty set). 2. The choice of “the search direction”. We step forward when we search the best subset among new subsets formed by including one feature in a current best subset: X  X  y, where y  arg max J  X  c . We step back when we search the best subset among cY subsets which formed by excluding one feature from a current best subset: X  X \ x, where x  arg max J  X \ c . cX 3. If after finding a next subset the stopping criterion is fulfilled, then the search pro- cess is stopped, else we go to the step 2. In this paper, we offer the "two steps forward, one step back" approach, and the best subset among other subsets with the same dimension is chosen as feature subset with the highest value of the criterion (1). 2.3 Evaluation of the training program effectiveness Let J 0 be a value of the criterion (1) for some feature subset obtained on the basis of data from the first visit to the doctor, and J 1 be a criterion value calculated according Information Technology and Nanotechnology (ITNT-2016) 904 Data Science Kutikova VV, Gaidel AV, Khramov AG… to data from the second visit. An evaluation of the training program effectiveness for some feature subset is calculated as follows: J1 E . (2) J0 The school has a “positive” effect on the feature subsets for which E  1 , a “nega- tive” effect on the subsets for which E  1 , and the school has no effect on subsets with E  1 . A conclusion about the effectiveness of the training program is done on the basis of obtained effectiveness values and within-group mean values of features which have more impact from the school. 3 Experimental results The study of training program effectiveness were carried on the basis of 12 features, namely: answers to a questionnaire, which was filled by patients before and after the training course, blood pressure (systolic, diastolic), hemostasis parameters (prothrom- bin time, prothrombin, fibrinogen, partial thromboplastin time). The dataset included 69 observations (36 observations from the main group, and 33 from the comparison group) for two visits. 3.1 Effectiveness of the training course for separate features Table 1 shows evaluation results of training program effectiveness for separate fea- tures. Each feature has a value of the criterion (1), which obtained on the basis of data from the first J 0 and second J 1 visits, as well as efficiency value E calculated ac- cording to the formula (2). Features are arranged in descending order of criterion value. Table 2 shows the within-class mean values for features presented in Table 1. According to Table 1, after the training course (the second visit) features 1, 2 and 4 properly distinguish the groups of patients compared to the first visit that is the effec- tiveness values are quite large for these features. In addition, as shown in Table 2, mean values within the main group for these features have increased from the first to the second visits and have changed slightly within the comparison group. Hence, the training therapeutic program is effective for features 1, 2 and 4. One also notices that for feature 3 the value J 0 is greater than the value J 1 that is this feature better distinguished the patient groups before the training course than after it. This effect is explained by the fact that within-class mean values have improved slightly from the first visit to the second (the patients began to realize importance of drug intake) and a within-class variance has increased. Information Technology and Nanotechnology (ITNT-2016) 905 Data Science Kutikova VV, Gaidel AV, Khramov AG… Table 1. Effectiveness of the training course for separate features № Feature J1 J0 E 1 Anticoagulant therapy 9.35 0.97 9.64 (0 – don’t take , 1 – take less than a year , 2 – from 1 to 5 years, 3 – more than 5 years) 2 How do you assess your level of knowledge 5.21 0.98 5.33 about the atrial fibrillation? (1 – low, 5 – high) 3 How important is it to regularly to take a drug 3.07 3.52 0.87 for stroke prevention in accordance with a prescription? (1 – no important, 5 – very im- portant) 4 How do you assess your knowledge about the 2.58 1.03 2.50 risk of stroke as the main complication of the atrial fibrillation? (1 – low, 5 – high) 5 Prothrombin (per cent) 1.19 1.08 1.10 6 Systolic blood pressure (mm Hg) 1.13 0.99 1.14 7 Prothrombin time (seconds) 1.13 0.99 1.14 8 Partial thromboplastin time (seconds) 1.07 1.03 1.03 9 Fibrinogen (g/l) 1.02 1.00 1.02 10 Aspirin (1 – take, 0 – don’t take ) 1.02 1.02 1.00 11 Diastolic blood pressure (mm Hg) 0.99 0.99 1.00 12 How much has the atrial fibrillation changed 0.99 0.99 1.00 your daily life? (1 – hasn’t changed, 5 – has changed greatly) Table 2. Within-group mean values of features № Main group Comparison group Visit 1 Visit 2 Visit 1 Visit 2 1 0.06 2.06 0.12 0.12 2 1.47 4.31 1.55 1.48 3 4.08 4.15 1.39 1.61 4 1.49 4.28 1.89 2.08 5 88.58 84.02 97.34 95.18 6 166.94 140.78 168.48 144.3 7 13.13 13.78 13.06 12.98 8 31.70 31.90 30.19 30.01 9 4.66 4.26 4.58 4.37 10 0.94 0.94 0.79 0.79 11 98.17 87.47 97.70 88.24 12 3.02 3.56 3.06 3.30 Information Technology and Nanotechnology (ITNT-2016) 906 Data Science Kutikova VV, Gaidel AV, Khramov AG… 3.2 Effectiveness of the training course for feature subsets Table 3 shows evaluation results of training program effectiveness for subsets which obtained in accordance with the sequential feature selection method presented in sec- tion 2.2. In Table 3 for the first 10 best subsets of features from Table 1 the values J 0 , J 1 and E are given. One can see that all 10 subsets better distinguish the patient groups after the training course, than before it. However, the first 3 subsets, which included features 1, 2 and 10, have large effectiveness value E compared to the remaining subsets. In addition, mean values of features 1 and 2 within the main group have increased from the first visit to the second and have not changed significantly within the comparison group. Taking into account these facts, we can conclude that the therapeutic training program is effective for features 1 and 2. Table 3. Effectiveness of the training course for feature subsets Features J1 J0 E 1 9.35 0.97 9.64 1, 2 6.01 0.98 6.14 1, 2, 10 5.20 0.98 5.29 1, 2, 3, 10 4.08 2.16 1.89 1, 2, 3, 9, 10 3.59 1.95 1.84 1, 2, 3, 4, 9, 10 3.29 1.72 1.91 1, 2, 3, 4, 7, 9, 10 2.63 1.46 1.81 1, 2, 3, 4, 7, 9, 10, 12 2.18 1.32 1.66 1, 2, 3, 4, 7, 8, 9, 10, 12 1.42 1.10 1.29 1, 2, 3, 4, 7, 8, 9, 10, 11, 12 1.24 1.05 1.18 4 Conclusion In this paper the research results of effectiveness of the training therapeutic program of the school "Stop a Stroke" are presented. Using two feature selection methods the feature subsets, which distinguish patients of the main and comparison groups in the best way, are founded on the basis of obtained after the training course patient data. Feature subsets, which have the large effectiveness values, are selected among the chosen best subsets. Taking into account the within-class mean values we concluded that, in general, this program turned out to be effective. In particular, the research of the training program effectiveness for separate features has shown that training at school “Stop a stroke” is effective for features “Anticoagu- lant therapy” (Е = 9.62), “How do you assess your level of knowledge about the atrial fibrillation?” (Е = 5.33) и “How do you assess your knowledge about the risk of a stroke as the main complication of the atrial fibrillation?” (Е = 2.50). Evaluating the effectiveness of the training course for feature subsets we obtained similar results. In that case the training program of the school turned out to be effective for the pair of features “Anticoagulant therapy” and “How do you assess your level of knowledge Information Technology and Nanotechnology (ITNT-2016) 907 Data Science Kutikova VV, Gaidel AV, Khramov AG… about the atrial fibrillation?” (Е = 6.14). On the other hand, the patients, who trained at the school in contrast to patients, who have not received training, take anticoagu- lants for a long time, and have a higher level of knowledge about the atrial fibrillation, stroke risk, as the main complication of the atrial fibrillation. Acknowledgements The work was partially funded by Russian Science Foundation (RSF), grant No. 14- 07-97040-р_поволжье_а, the Russian Federation Ministry of Education and Science and Fundamental Research Program NITD RAS «Bioinformatics, modern infor- mation technologies and mathematical methods in medicine». References 1. Spira A, Beane JE, Shah V, Steiling K, Liu G, Schembri F, Gilman S, Dumas Y-M, Calner P, Sebastiani P, Sridhar S, Beamis J, Lamb C, Anderson T, Gerry, N, Keane J, Lenburg ME, Brody JS. Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nature Medicine, 2007; 13(3): 361-366. 2. Ilyasova NY, Kupriyanov AV, Paringer RA. Formation of features for improving the quality of medical diagnosis based on discriminant analysis methods. Computer Optics, 2014; 38(4): 851-855. [In Russian] 3. Peng Y, Wu Z, Jiang J. A novel feature selection approach for biomedical data classification. Journal of Biomedical Informatics, 2010; 43: 15-23. 4. Tsai C-F, Eberle W, Chu CY. Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 2013; 39: 240-247. 5. Gaidel AV, Pervushkin SS. Research of the textural features for the bony tissue diseases diagnostics using the roentgenograms. Computer Optics, 2013; 37(1): 113-119. [In Rus- sian] 6. Kutikova VV, Gaidel AV. Study of informative feature selection approaches for the tex- ture image recognition problem using the Laws’ masks. Computer Optics, 2015; 39(5): 744-750. DOI: 10.18287/0134-2452-2015-39-5-744-750. 7. Fukunaga K. Introduction to statistical pattern recognition. San Diego: Academic Press, 1990. Information Technology and Nanotechnology (ITNT-2016) 908