=Paper=
{{Paper
|id=Vol-2277/paper27
|storemode=property
|title=
Various Machine Learning Methods Efficiency Comparison in Application to Inorganic Compounds Design
|pdfUrl=https://ceur-ws.org/Vol-2277/paper27.pdf
|volume=Vol-2277
|authors=Oleg Sen'ko,Nadezhda Kiselyova,Victor Dudarev,Alexander Dokukin,Vladimir Ryazanov
|dblpUrl=https://dblp.org/rec/conf/rcdl/SenkoKDDR18
}}
==
Various Machine Learning Methods Efficiency Comparison in Application to Inorganic Compounds Design
==
Various Machine Learning Methods Efficiency Comparison in Application to Inorganic Compounds Design © O.V. Sen’ko1 © N.N. Kiselyova2 © V.A. Dudarev2 © A.A. Dokukin1 © V.V. Ryazanov1 1 Federal Research Center “Computer Science and Control“ of the Russian Academy of Sciences, Moscow, Russia 2 Institution of Russian Academy of Sciences A.A. Baikov Institute of Metallurgy and Materials Science RAS, Moscow, Russia senkoov@mail.ru kis@imet.ac.ru vic@imet.ac.ru alex_dok@mail.ru rvvccas@mail.ru Abstract. Various machine learning methods («Recognition» package and «Scikit-learn» package for Python) accuracy comparison was made on example of inorganic chemistry tasks solution. The cross- validation and the ROC-analysis were applied to accuracy estimation. Keywords: machine learning methods comparison, pattern recognition, «Recognition», «Scikit-learn». that an accuracy depends on attribute description 1 Introduction informativeness and training sample representativeness. Therefore, to evaluate various ML methods we have Machine learning (ML) methods are widely used in the chosen a number of tasks with highly reliable predictions inorganic compounds formation predicting and their (more than 85 % according to the later experimental properties estimation [1-7]. The paper [5] contains a verification) [6, 7]. statistical analysis of popularity of various ML methods that applied to inorganic materials science. However, in 2 Prediction accuracy estimation methods spite of these methods success for numerous tasks solution in this subject field, no effort of accuracy The cross-validation (CV) on training sample of objects comparison of wide variety of methods was made using is the most widely used universal and reliable tool for ROC-analysis. machine learning quality estimation. At that a number of To solve this task the subject field particularities must recognition error can be taken into account. However, be taken into account. In particular, it is obvious that an one of the problems in ML accuracy estimation task is attribute description has a composite structure: the set of the recognition efficiency determination in the chemical elements parameters (the components of an asymmetrical classes case where the number of different inorganic substance) is repeated as many times as there classes objects differs significantly. This situation is very are elements included into the compound. Due to common in cases when only a very few new materials periodical dependence of chemical elements properties were obtained with the important practical properties and on their atomic numbers the strong correlation within a search for analogues of these substances that are not yet sets of each component parameters is observed. Relative synthesized allows an experimental researches time and informativeness of individual element’s properties is cost reduction. In the majority of ML methods low. For this reason, the simpler compounds properties application cases the standard decision rule minimizes (e.g., simple oxides, halogenides, chalcogenides, etc.) as the total number of erroneous predictions. It results in well as the algebraic functions of components’ properties good recognition of compounds from the large class and are used. Although these parameters are studied very in bad recognition of substances representing small class. well but there are gaps of properties’ values (incomplete As a result, the overall recognition accuracy gives poor data). They are filled in a variety of ways. For example, notion of the efficiency of one or another method or one the periodic dependences of elements’ parameters on or another attribute description. The Receiver Operating their atomic numbers and the appropriate interpolation Characteristic (ROC) analysis application is an and extrapolation are used. The large asymmetry of alternative approach. It allows a recognition accuracy training sample sizes for different classes is a peculiarity comparison for the targeted and alternative classes at in inorganic chemistry tasks. Very often the least variation of cut-offs which identifies belonging to representative classes (as a rule – newly discovered different classes. classes of substances) are the most interesting point to The following prediction accuracy estimation chemists. The experimental errors and discrepancies of procedures were used in this analysis fulfilling. The inorganic compounds classification in training samples available training sample is divided into two are yet another problem at compound design that nonintersecting stratified subsamples which were later decreases a prediction accuracy drastically. Doubtless, used to train and assess simple and collective methods independently. Further, the ROC-analysis is carried-out Proceedings of the XX International Conference and the Area Under Curve (AUC) measure is calculated. “Data Analytics and Management in Data Intensive As a rule, in collective decision making the methods with Domains” (DAMDID/RCDL’2018), Moscow, Russia, AUC more than some fixed threshold value is used in October 9-12, 2018 152 prediction. • SBT – the search for the best test (maximal number of ε- thresholds for one attribute = 5; maximal size of 3 The test tasks sample = 20; number of samples of the same size = 3; percent of tests using in recognition – 10 %; unitary 3.1 Prediction of formation of compounds with the weights), LOOCV; composition A2BCHal6 (A and C are various • TLS – the two-dimensional linear separators method monovalent metals; B are trivalent metals; and Hal (bias step – 0; right part components – 0.1; number of is F, Cl, or Br) [7]. iterations – 10000; number of start iteration – 100; 2 classes: percentage of removed objects – 1; step – 100; 4. formation of the compound – 744 examples; threshold of regularity selection – 80 %), 10-fold CV; 5. nonformation of the compound – 170 examples. • BDT – the binary decision tree learning (maximal number of nodes (interior nodes) – 15; minimal 137 attributes including 3 the most informative significant value of entropy reduction – 0.2; minimal algebraic functions of the initial attributes. number of objects in leaf nodes – 5), LOOCV; 3.2 Prediction of formation and crystal structure • LDF – the linear Fisher discriminant (confidence type of compounds with composition A2BCHal6 [7]. threshold for correlation coefficient – 0), LOOCV; • LM – the linear machine method (bias step – 0; right 4 classes: part components – 0.1; number of iterations – 10000; 1. elpasolites – 283 examples; number of start iteration – 100; percentage of 2. compounds with the Cs2NaCrF6 crystal structure type excluding objects – 1; step – 100), LOOCV; – 19 examples; • LoReg – the voting algorithm where estimations for 3. another crystal structure types – 57 examples; classes are calculated with the help of voting by 4. nonformation of the compound – 83 examples. logical regularities system (“greedy” way; number of 134 attributes. intervals - 5; maximal number of iterations – 100000; beginning of removal – 100; percentage of removed 3.3 Prediction of formation of compounds with the inequalities – 1%; removal step– 100; minimal rate of composition ABHal3 (A are various monovalent objects – 0.1; number of random permutations – 3), metals; B are bivalent metals; Hal is F, Cl, Br, or I) 10-fold CV; [6]. • MNN – the multiplicative neural network algorithm 2 classes: (number of iterations – 1000), LOOCV; 1. formation of the compound – 237 examples; • MP – the multilayer perceptron (neural network 2. nonformation of the compound – 107 examples. configuration: number of hidden layers – 3 (number of neurons in layer - 10); number of training iterations 88 attributes. – 3000; activation function – sigmoid; training speed 3.4 Prediction of formation and crystal structure – 0.1; moment of inertia – 0; lack of criterion function type of compounds with composition ABHal3 [6]. increase if there is no increase during last 1000 iterations then the speed is decreased in 2 times), 10- 6 classes: fold CV; 1. perovskites – 46 examples; • ANN – the artificial neural network learning using 2. compounds with the GdFeO3 crystal structure type – back-propagation (neural network configuration: 20 examples; number of hidden layers – 3 (number of neurons in 3. compounds with the CsNiCl3 crystal structure type – layer - 10); number of training iterations – 500; 38 examples; activation function – sigmoid; training speed – 0.1; 4. compounds with the NH4CdCl3 crystal structure type threshold – 0.1; lack of criterion function increase if – 23 examples; there is no increase during last 100 iterations then the 5. another crystal structure types – 39 examples; speed is decreased in 2 times), 10-fold CV; 6. nonformation of the compound – 111 examples. • KNN – the k-nearest neighbors method (number of 88 attributes. nearest neighbors – 1; prior class probabilities are took into account), LOOCV; The most important attribute sets were selected using the • SVM – the support vector machine (penalty program based on the method [8]. coefficient – 5; kernel function type – Gaussian; kernel function parameter – 6; maximal number of 4 The analysis of obtained results iterations – 500, 10-fold CV); • SWS – the statistical weighted syndromes (rapid The Table 1 contains the efficiency estimation results for mode; number of partition borders – 1; optimized single machine learning methods. The following criteria threshold – 4.5; representativeness threshold algorithms notations were used (“Recognition” package – 0.5; instability threshold - 0.2; denial zone – 0.1), [9]): 10-fold CV; • ECA – the estimates calculation algorithm (fixed size • DTA – the deadlock test algorithm (test searching of support sets = 1), Leave-One-Out CV (LOOCV); algorithm – effective; divisor of ε- thresholds = 2; 153 maximal size of sample = 20; the number of Algorithm CV accuracy, % AUC subsamples of the same size = 3), LOOCV. ANN 67.1 0.766 “Scikit-learn package for Python” [10] - 10-fold CV: KNN 70.0 0.751 • LIR – linear_model.LinearRegression; LM 65.3 0.734 • R – linear_model.Ridge; MNN 66.7 0.694 • L – linear_model.Lasso; TLS 64.8 0.675 • EN – linear_model.ElasticNet; LDF 71.4 0.671 • LL – linear_model.LassoLars; MP 66.7 0.666 • OMP – linear_model.OrthogonalMatchingPursuit; SBT 60.6 0.657 • BR – linear_model. BayesianRidge; ECA 70.4 0.653 • HR –linear_model. HuberRegressor; BDT 71.8 0.251 • KR – KernelRidge; System “Recognition” – Task 3 • PLS – PLSRegression; SVM 77.2 0.845 • SGDC – linear_model.SGDClassifier; TLS 75.0 0.822 • P –linear_model.Perceptron; ECA 81.1 0.816 • PACH – the passive aggressive classifier LM 77.8 0.804 (loss='hinge'); DTA 78.9 0.801 • PACS – the passive aggressive classifier LoReg 77.2 0.799 (loss='squared_hinge'); SWS 73.3 0.788 • LSVC – linear SVC; MNN 73.3 0.772 • NSVC1 – nuSVC (nu=0.1); ANN 75.0 0.767 • NSVC3 – nuSVC (nu=0.3); SBT 78.3 0.737 • LR – linear_model.LogisticRegression; MP 72.8 0.733 • GPC – Gaussian process classifier; KNN 71.7 0.700 • GNB – Gaussian naive Bayes; LDF 71.7 0.675 BDT 78.9 0.607 • DTC – tree.DecisionTreeClassifier; System “Recognition” – Task 4 • KNN – KNeighborsClassifier (n_neighbors=5); DTA 59.4 0.865 • MP – neural_network.MLPClassifier; LM 62.9 0.857 • BC – ensemble.BaggingClassifier; ANN 71.3 0.850 • RFC – ensemble.RandomForestClassifier; SWS 47.6 0.847 • ETC – ensemble.ExtraTreesClassifier; LoReg 64.3 0.843 • ABC – ensemble.AdaBoostClassifier; SBT 56.6 0.836 • GBC – ensemble.GradientBoostingClassifier. SVM 67.1 0.832 BDT 59.4 0.803 Table 1 The accuracy estimation of various single ECA 60.1 0.780 machine learning methods LDF 50.3 0.756 Algorithm CV accuracy, % AUC KNN 62.9 0.742 System “Recognition” – Task 1 MP 49.7 0.725 SVM 89.8 0.916 MNN 48.3 0.684 LM 90.7 0.884 Scikit-learn in Python [9] – Task 1 ANN 89.1 0.880 GBC 93.3 0.959 SWS 82.3 0.872 BC 92.2 0.951 LoReg 87.8 0.877 ETC 92.0 0.948 TLS 84.7 0.863 RFC 92.2 0.945 DTA 84.0 0.861 MP 92.0 0.935 MNN 87.1 0.827 NSVC1 93.3 0.930 MP 84.5 0.816 ABC 91.6 0.927 KNN 87.6 0.805 NSVC3 89.6 0.911 ECA 83.6 0.799 LIR 89.8 0.907 LDF 86.0 0.754 R 89.6 0.905 SBT 85.6 0.745 KR 77.1 0.905 BDT 81.4 0 LSVC 89.2 0.902 System “Recognition” – Task 2 GPC 90.7 0.900 DTA 61.5 0.864 BR 88.3 0.895 SVM 71.8 0.842 LR 89.0 0.895 SWS 58.2 0.780 OMP 88.7 0.886 LoReg 68.1 0.776 KNN 89.8 0.880 154 Algorithm CV accuracy, % AUC • LC – the logic corrector; HR 82.5 0.850 • GPC – the generalized polynomial corrector PACH 83.3 0.834 (minimal mean deviation = 0); PACS 83.3 0.834 • DC – the domains of competence (number of the SGDC 82.5 0.828 domains of competence =3); PLS 81.8 0.815 • DT – the decision templates. P 83.3 0.812 “Scikit-learn package for Python” [9]: DTC 88.1 0.806 • VCS – ensemble.VotingClassifier (voting='soft'); GNB 78.1 0.796 • VCH – ensemble.VotingClassifier (voting='hard'); EN 81.4 0.5 LL 81.4 0.5 Table 2 The accuracy estimation of various collective L 81.4 0.5 methods Scikit-learn in Python [9] – Task 3 RFC 85.4 0.935 Algorithm CV accuracy, % AUC MP 89.0 0.935 System “Recognition” – Task 1 GBC 85.4 0.931 CCA 91.8 0.920 NSVC3 85.4 0.925 LC 90.3 0.918 HR 85.4 0.923 GPC 88.3 0.896 P 89.6 0.917 CCM 87.4 0.893 PACH 85.4 0.917 BM 86.8 0.885 PACS 85.4 0.917 DC 92.9 0.847 BC 86.6 0.916 DT 92.0 0.796 LIR 86.6 0.916 AC 91.6 0.770 NSVC1 84.1 0.916 WD 82.3 0.719 R 87.8 0.913 System “Recognition” – Task 2 KR 81.7 0.913 CCA 81.2 0.906 LR 87.8 0.913 GPC 80.8 0.904 ETC 85.4 0.912 LC 72.1 0.893 BR 87.8 0.911 DC 79.5 0.874 SGDC 86.0 0.910 CCM 75.5 0.864 PLS 86.0 0.905 BM 79.0 0.812 LSVC 86.6 0.905 WD 62.0 0.742 OMP 88.4 0.899 DT 80.8 0.727 KNN 82.3 0.881 AC 55.0 0.711 GPC 82.9 0.860 System “Recognition” – Task 3 ABC 83.5 0.856 CCA 87.2 0.906 GNB 76.2 0.831 GPC 87.2 0.904 DTC 84.1 0.813 LC 86.0 0.893 L 69.5 0.5 DC 87.2 0.874 EN 69.5 0.5 CCM 87.2 0.864 LL 69.5 0.5 BM 86.6 0.812 WD 81.1 0.742 The Table 2 includes the results of efficiency of DT 85.4 0.727 algorithms ensembles methods estimation. The AC 82.3 0.711 following notations of algorithms were used System “Recognition” – Task 4 (“Recognition” package [8]): LC 50.7 0.847 • AC – the algebraic corrector (quadratic merit BM 55.2 0.840 functional; minimal mean deviation = 0); WD 55.2 0.827 • CS – the convex stabilizer (function type – CCA 61.2 0.815 Gaussian); CCM 63.4 0.787 • WD – the Woods dynamic method (number of DT 59.0 0.745 objects in vicinity = 10); DC 60.4 0.651 • CCA – the complex committee method⎯averaging; GPC 59.7 0.646 • CCM – the complex committee method⎯majority AC 52.2 0.646 voting; Scikit-learn in Python [9] – Task 1 • BM – the Bayes method; VCS 94.4 0.889 • CAS – the clustering and selection (number of VCH 93.7 0.867 clusters = 3); Scikit-learn in Python [9] – Task 3 VCS 87.2 0.852 155 Algorithm CV accuracy, % AUC Ispol’zovanie baz dannykh i metodov VCH 86.6 0.836 iskusstvennogo intellekta (Computer Design of The collective decision-making methods use algorithms Inorganic Compounds: Use of Databases and for which AUC-values were marked by boldfaced types (see Artificial Intelligence Methods). Moscow: Table 1). We used «default option»-mode for choosing of Nauka. 2005. algorithms parameter values. [2] N.Y. Chen, W.C. Lu, J. Yang, G.Z. Li, Support It should be noted that in most cases the choice of the vector machine in chemistry. Singapore: World most accurate single ML methods using the cross- Scientific Publishing Co. Pte. Ltd. 2004. validation and the ROC-analysis coincides. The best [3] N.N. Kiselyova. Computer design of materials algorithms (according to AUC-value) (see Table 1) are with artificial intelligence methods. In methods based on the support vector machine (SVM), the Intermetallic Compounds. Principles and deadlock test (DTA), the artificial neural network Practice, Vol.3, Westbrook, J.H. & Fleischer, learning (ANN), as well as the linear machine (LM), the R.L. eds., p. 811-839, Chichester, UK: John statistical weighted syndromes (SWS), and the two- Wiley&Sons, Ltd. 2002. dimensional linear separators (TLS). The Gradient [4] T. Mueller, A.G. Kusne, R. Ramprasad. Boosting (GBC) crowds the top of the list in Scikit-learn Machine Learning in Materials Science. Recent package. The worst algorithms are the binary decision Progress and Emerging Applications. Reviews tree learning (BDT), the search for the best test (SBT), in Computational Chemistry, 29, p. 186–273, the linear Fisher discriminant (LDF), the Elastic Net 2016. (EN), and the Lasso (L and LL). [5] N.N. Kiselyova, A.V. Stolyarenko, V.A. The most efficient algorithms ensembles (see Table Dudarev. Machine Learning Methods 2) are the complex committee method⎯averaging (CCA), Application to Search for Regularities in the logic corrector (LC), the generalized polynomial Chemical Data. Selected Papers of the XIX corrector (GPC), and the voting (VC). In most cases the International Conference on Data Analytics and algorithms ensembles application allows a prediction Management in Data Intensive Domains accuracy increase. (DAMDID/RCDL 2017). Moscow, Russia, October 9-13, 2017. CEUR Workshop 5 Conclusions Proceedings, v.2022, p. 375-380, 2017. The problem of the most accurate algorithms http://ceur-ws.org/Vol-2022/paper57.pdf. selection belongs to the most important tasks of ML. To [6] N.N. Kiseleva. Prediction of the new solve this task the subject field peculiarities must be compounds in the systems of halogenides of the taken into account. In this research the ML-software univalent and bivalent metals. Russian Journal from «Recognition» and «Scikit-learn» packages were of Inorganic Chemistry, 59(5), p. 496–502, tested in inorganic compounds prediction tasks. As a 2014. rule, small sizes of training samples in these tasks do not [7] N.N. Kiselyova, A.V. Stolyarenko, V.V. allow a selection of representative objects subset for Ryazanov, O.V. Sen’ko, A.A. Dokukin. examinational recognition. In that context the cross- Prediction of New Halo-Elpasolites. Russian validation using training sample is the most acceptable Journal of Inorganic Chemistry. 61(5), p. 604- procedure for ML algorithms accuracy estimation. The 609, 2016. substantial difference in numbers of different classes of [8] O.V. Senko. An Optimal Ensemble of objects is a peculiarity of inorganic chemical tasks. Predictors in Convex Correcting Procedures. Therefore, the ROC-analysis is the most acceptable Pattern Recognition and Image Analysis. 19(3), method for these algorithms accuracy evaluation. p. 465-468, 2009. Acknowledgments. This work was partially [9] Yu. I. Zhuravlev, V. V. Ryazanov, and O. V. supported by the Russian Foundation for Basic Research Sen’ko. RECOGNITION. Mathematical (project nos. 17-07-01362 and 18-07-00080) and State methods. Software system. Practical solutions. assignments No. 007-00129-18-00 and 0063-2020-0003. Moscow: Phasis. 2006. [10] Pedregosa et al. Scikit-learn: Machine Learning References in Python, JMLR 12, pp. 2825-2830, 2011. [1] N.N. Kiselyova. Komp’yuternoe konstruirovanie neorganicheskikh soedinenii. 156