=Paper=
{{Paper
|id=Vol-2277/paper27
|storemode=property
|title=
Various Machine Learning Methods Efficiency Comparison in Application to Inorganic Compounds Design

|pdfUrl=https://ceur-ws.org/Vol-2277/paper27.pdf
|volume=Vol-2277
|authors=Oleg Sen'ko,Nadezhda Kiselyova,Victor Dudarev,Alexander Dokukin,Vladimir Ryazanov
|dblpUrl=https://dblp.org/rec/conf/rcdl/SenkoKDDR18
}}
==
Various Machine Learning Methods Efficiency Comparison in Application to Inorganic Compounds Design
==
<pdf width="1500px">https://ceur-ws.org/Vol-2277/paper27.pdf</pdf>
<pre>
  Various Machine Learning Methods Efficiency Comparison
       in Application to Inorganic Compounds Design
     © O.V. Sen’ko1 © N.N. Kiselyova2 © V.A. Dudarev2 © A.A. Dokukin1 © V.V. Ryazanov1
 1
   Federal Research Center “Computer Science and Control“ of the Russian Academy of Sciences,
                                          Moscow, Russia
   2
     Institution of Russian Academy of Sciences A.A. Baikov Institute of Metallurgy and Materials
                                   Science RAS, Moscow, Russia
      senkoov@mail.ru kis@imet.ac.ru vic@imet.ac.ru alex_dok@mail.ru rvvccas@mail.ru
           Abstract. Various machine learning methods («Recognition» package and «Scikit-learn» package for
     Python) accuracy comparison was made on example of inorganic chemistry tasks solution. The cross-
     validation and the ROC-analysis were applied to accuracy estimation.
           Keywords: machine learning methods comparison, pattern recognition, «Recognition», «Scikit-learn».

                                                                       that an accuracy depends on attribute description
 1 Introduction                                                        informativeness and training sample representativeness.
                                                                       Therefore, to evaluate various ML methods we have
 Machine learning (ML) methods are widely used in the                  chosen a number of tasks with highly reliable predictions
 inorganic compounds formation predicting and their                    (more than 85 % according to the later experimental
 properties estimation [1-7]. The paper [5] contains a                 verification) [6, 7].
 statistical analysis of popularity of various ML methods
 that applied to inorganic materials science. However, in
                                                                       2 Prediction accuracy estimation methods
 spite of these methods success for numerous tasks
 solution in this subject field, no effort of accuracy                 The cross-validation (CV) on training sample of objects
 comparison of wide variety of methods was made using                  is the most widely used universal and reliable tool for
 ROC-analysis.                                                         machine learning quality estimation. At that a number of
     To solve this task the subject field particularities must         recognition error can be taken into account. However,
 be taken into account. In particular, it is obvious that an           one of the problems in ML accuracy estimation task is
 attribute description has a composite structure: the set of           the recognition efficiency determination in the
 chemical elements parameters (the components of an                    asymmetrical classes case where the number of different
 inorganic substance) is repeated as many times as there               classes objects differs significantly. This situation is very
 are elements included into the compound. Due to                       common in cases when only a very few new materials
 periodical dependence of chemical elements properties                 were obtained with the important practical properties and
 on their atomic numbers the strong correlation within                 a search for analogues of these substances that are not yet
 sets of each component parameters is observed. Relative               synthesized allows an experimental researches time and
 informativeness of individual element’s properties is                 cost reduction. In the majority of ML methods
 low. For this reason, the simpler compounds properties                application cases the standard decision rule minimizes
 (e.g., simple oxides, halogenides, chalcogenides, etc.) as            the total number of erroneous predictions. It results in
 well as the algebraic functions of components’ properties             good recognition of compounds from the large class and
 are used. Although these parameters are studied very                  in bad recognition of substances representing small class.
 well but there are gaps of properties’ values (incomplete             As a result, the overall recognition accuracy gives poor
 data). They are filled in a variety of ways. For example,             notion of the efficiency of one or another method or one
 the periodic dependences of elements’ parameters on                   or another attribute description. The Receiver Operating
 their atomic numbers and the appropriate interpolation                Characteristic (ROC) analysis application is an
 and extrapolation are used. The large asymmetry of                    alternative approach. It allows a recognition accuracy
 training sample sizes for different classes is a peculiarity          comparison for the targeted and alternative classes at
 in inorganic chemistry tasks. Very often the least                    variation of cut-offs which identifies belonging to
 representative classes (as a rule – newly discovered                  different classes.
 classes of substances) are the most interesting point to                  The following prediction accuracy estimation
 chemists. The experimental errors and discrepancies of                procedures were used in this analysis fulfilling. The
 inorganic compounds classification in training samples                available training sample is divided into two
 are yet another problem at compound design that                       nonintersecting stratified subsamples which were later
 decreases a prediction accuracy drastically. Doubtless,               used to train and assess simple and collective methods
                                                                       independently. Further, the ROC-analysis is carried-out
Proceedings of the XX International Conference                         and the Area Under Curve (AUC) measure is calculated.
“Data Analytics and Management in Data Intensive                       As a rule, in collective decision making the methods with
Domains” (DAMDID/RCDL’2018), Moscow, Russia,                           AUC more than some fixed threshold value is used in
October 9-12, 2018


                                                                 152
prediction.                                                        • SBT – the search for the best test (maximal number
                                                                     of ε- thresholds for one attribute = 5; maximal size of
3 The test tasks                                                     sample = 20; number of samples of the same size =
                                                                     3; percent of tests using in recognition – 10 %; unitary
3.1 Prediction of formation of compounds with the                    weights), LOOCV;
composition A2BCHal6 (A and C are various                          • TLS – the two-dimensional linear separators method
monovalent metals; B are trivalent metals; and Hal                   (bias step – 0; right part components – 0.1; number of
is F, Cl, or Br) [7].                                                iterations – 10000; number of start iteration – 100;
2 classes:                                                           percentage of removed objects – 1; step – 100;
4. formation of the compound – 744 examples;                         threshold of regularity selection – 80 %), 10-fold CV;
5. nonformation of the compound – 170 examples.                    • BDT – the binary decision tree learning (maximal
                                                                     number of nodes (interior nodes) – 15; minimal
    137 attributes including 3 the most informative                  significant value of entropy reduction – 0.2; minimal
algebraic functions of the initial attributes.                       number of objects in leaf nodes – 5), LOOCV;
3.2 Prediction of formation and crystal structure                  • LDF – the linear Fisher discriminant (confidence
type of compounds with composition A2BCHal6 [7].                     threshold for correlation coefficient – 0), LOOCV;
                                                                   • LM – the linear machine method (bias step – 0; right
4 classes:                                                           part components – 0.1; number of iterations – 10000;
1. elpasolites – 283 examples;                                       number of start iteration – 100; percentage of
2. compounds with the Cs2NaCrF6 crystal structure type               excluding objects – 1; step – 100), LOOCV;
    – 19 examples;                                                 • LoReg – the voting algorithm where estimations for
3. another crystal structure types – 57 examples;                    classes are calculated with the help of voting by
4. nonformation of the compound – 83 examples.                       logical regularities system (“greedy” way; number of
    134 attributes.                                                  intervals - 5; maximal number of iterations – 100000;
                                                                     beginning of removal – 100; percentage of removed
3.3 Prediction of formation of compounds with the                    inequalities – 1%; removal step– 100; minimal rate of
composition ABHal3 (A are various monovalent                         objects – 0.1; number of random permutations – 3),
metals; B are bivalent metals; Hal is F, Cl, Br, or I)               10-fold CV;
[6].                                                               • MNN – the multiplicative neural network algorithm
2 classes:                                                           (number of iterations – 1000), LOOCV;
1. formation of the compound – 237 examples;                       • MP – the multilayer perceptron (neural network
2. nonformation of the compound – 107 examples.                      configuration: number of hidden layers – 3 (number
                                                                     of neurons in layer - 10); number of training iterations
    88 attributes.
                                                                     – 3000; activation function – sigmoid; training speed
3.4 Prediction of formation and crystal structure                    – 0.1; moment of inertia – 0; lack of criterion function
type of compounds with composition ABHal3 [6].                       increase if there is no increase during last 1000
                                                                     iterations then the speed is decreased in 2 times), 10-
6 classes:                                                           fold CV;
1. perovskites – 46 examples;                                      • ANN – the artificial neural network learning using
2. compounds with the GdFeO3 crystal structure type –                back-propagation (neural network configuration:
    20 examples;                                                     number of hidden layers – 3 (number of neurons in
3. compounds with the CsNiCl3 crystal structure type –               layer - 10); number of training iterations – 500;
    38 examples;                                                     activation function – sigmoid; training speed – 0.1;
4. compounds with the NH4CdCl3 crystal structure type                threshold – 0.1; lack of criterion function increase if
    – 23 examples;                                                   there is no increase during last 100 iterations then the
5. another crystal structure types – 39 examples;                    speed is decreased in 2 times), 10-fold CV;
6. nonformation of the compound – 111 examples.                    • KNN – the k-nearest neighbors method (number of
    88 attributes.                                                   nearest neighbors – 1; prior class probabilities are
                                                                     took into account), LOOCV;
The most important attribute sets were selected using the          • SVM – the support vector machine (penalty
program based on the method [8].                                     coefficient – 5; kernel function type – Gaussian;
                                                                     kernel function parameter – 6; maximal number of
4 The analysis of obtained results                                   iterations – 500, 10-fold CV);
                                                                   • SWS – the statistical weighted syndromes (rapid
The Table 1 contains the efficiency estimation results for
                                                                     mode; number of partition borders – 1; optimized
single machine learning methods. The following
                                                                     criteria threshold – 4.5; representativeness threshold
algorithms notations were used (“Recognition” package
                                                                     – 0.5; instability threshold - 0.2; denial zone – 0.1),
[9]):
                                                                     10-fold CV;
• ECA – the estimates calculation algorithm (fixed size
                                                                   • DTA – the deadlock test algorithm (test searching
    of support sets = 1), Leave-One-Out CV (LOOCV);
                                                                     algorithm – effective; divisor of ε- thresholds = 2;


                                                             153
    maximal size of sample = 20; the number of                   Algorithm        CV accuracy, %          AUC
    subsamples of the same size = 3), LOOCV.                       ANN                  67.1             0.766
    “Scikit-learn package for Python” [10] - 10-fold CV:           KNN                  70.0             0.751
•   LIR – linear_model.LinearRegression;                            LM                  65.3             0.734
•   R – linear_model.Ridge;                                        MNN                  66.7             0.694
•   L – linear_model.Lasso;                                         TLS                 64.8             0.675
•   EN – linear_model.ElasticNet;                                   LDF                 71.4             0.671
•   LL – linear_model.LassoLars;                                    MP                  66.7             0.666
•   OMP – linear_model.OrthogonalMatchingPursuit;                   SBT                 60.6             0.657
•   BR – linear_model. BayesianRidge;                              ECA                  70.4             0.653
•   HR –linear_model. HuberRegressor;                              BDT                  71.8             0.251
•   KR – KernelRidge;                                                   System “Recognition” – Task 3
•   PLS – PLSRegression;                                           SVM                  77.2             0.845
•   SGDC – linear_model.SGDClassifier;                              TLS                 75.0             0.822
•   P –linear_model.Perceptron;                                    ECA                  81.1             0.816
•   PACH – the passive aggressive classifier                        LM                  77.8             0.804
    (loss='hinge');                                                DTA                  78.9             0.801
•   PACS – the passive aggressive classifier                      LoReg                 77.2             0.799
    (loss='squared_hinge');                                        SWS                  73.3             0.788
•   LSVC – linear SVC;                                             MNN                  73.3             0.772
•   NSVC1 – nuSVC (nu=0.1);                                        ANN                  75.0             0.767
•   NSVC3 – nuSVC (nu=0.3);                                         SBT                 78.3             0.737
•   LR – linear_model.LogisticRegression;                           MP                  72.8             0.733
•   GPC – Gaussian process classifier;                             KNN                  71.7             0.700
•   GNB – Gaussian naive Bayes;                                     LDF                 71.7             0.675
                                                                   BDT                  78.9             0.607
•   DTC – tree.DecisionTreeClassifier;
                                                                        System “Recognition” – Task 4
•   KNN – KNeighborsClassifier (n_neighbors=5);
                                                                   DTA                  59.4             0.865
•   MP – neural_network.MLPClassifier;
                                                                    LM                  62.9             0.857
•   BC – ensemble.BaggingClassifier;
                                                                   ANN                  71.3             0.850
•   RFC – ensemble.RandomForestClassifier;
                                                                   SWS                  47.6             0.847
•   ETC – ensemble.ExtraTreesClassifier;
                                                                  LoReg                 64.3             0.843
•   ABC – ensemble.AdaBoostClassifier;
                                                                    SBT                 56.6             0.836
•   GBC – ensemble.GradientBoostingClassifier.
                                                                   SVM                  67.1             0.832
                                                                   BDT                  59.4             0.803
Table 1 The accuracy estimation of various single
                                                                   ECA                  60.1             0.780
machine learning methods
                                                                    LDF                 50.3             0.756
    Algorithm      CV accuracy, %        AUC                       KNN                  62.9             0.742
           System “Recognition” – Task 1                            MP                  49.7             0.725
      SVM               89.8             0.916                     MNN                  48.3             0.684
       LM               90.7             0.884                         Scikit-learn in Python [9] – Task 1
      ANN               89.1             0.880                     GBC                  93.3             0.959
      SWS               82.3             0.872                      BC                  92.2             0.951
     LoReg              87.8             0.877                      ETC                 92.0             0.948
       TLS              84.7             0.863                      RFC                 92.2             0.945
      DTA               84.0             0.861                      MP                  92.0             0.935
      MNN               87.1             0.827                    NSVC1                 93.3             0.930
       MP               84.5             0.816                     ABC                  91.6             0.927
      KNN               87.6             0.805                    NSVC3                 89.6             0.911
      ECA               83.6             0.799                      LIR                 89.8             0.907
       LDF              86.0             0.754                        R                 89.6             0.905
       SBT              85.6             0.745                      KR                  77.1             0.905
      BDT               81.4               0                       LSVC                 89.2             0.902
           System “Recognition” – Task 2                            GPC                 90.7             0.900
      DTA               61.5             0.864                      BR                  88.3             0.895
      SVM               71.8             0.842                       LR                 89.0             0.895
      SWS               58.2             0.780                     OMP                  88.7             0.886
     LoReg              68.1             0.776                     KNN                  89.8             0.880


                                                           154
   Algorithm        CV accuracy, %          AUC               • LC – the logic corrector;
      HR                  82.5             0.850              • GPC – the generalized polynomial corrector
     PACH                 83.3             0.834                (minimal mean deviation = 0);
     PACS                 83.3             0.834              • DC – the domains of competence (number of the
     SGDC                 82.5             0.828                domains of competence =3);
      PLS                 81.8             0.815              • DT – the decision templates.
        P                 83.3             0.812                “Scikit-learn package for Python” [9]:
      DTC                 88.1             0.806              • VCS – ensemble.VotingClassifier (voting='soft');
      GNB                 78.1             0.796              • VCH – ensemble.VotingClassifier (voting='hard');
       EN                 81.4               0.5
       LL                 81.4               0.5              Table 2 The accuracy estimation of various collective
        L                 81.4               0.5              methods
         Scikit-learn in Python [9] – Task 3
      RFC                 85.4             0.935                  Algorithm       CV accuracy, %          AUC
      MP                  89.0             0.935                         System “Recognition” – Task 1
      GBC                 85.4             0.931                    CCA                 91.8             0.920
    NSVC3                 85.4             0.925                     LC                 90.3             0.918
      HR                  85.4             0.923                     GPC                88.3             0.896
        P                 89.6             0.917                    CCM                 87.4             0.893
     PACH                 85.4             0.917                     BM                 86.8             0.885
     PACS                 85.4             0.917                     DC                 92.9             0.847
       BC                 86.6             0.916                     DT                 92.0             0.796
      LIR                 86.6             0.916                     AC                 91.6             0.770
    NSVC1                 84.1             0.916                     WD                 82.3             0.719
        R                 87.8             0.913                         System “Recognition” – Task 2
      KR                  81.7             0.913                    CCA                 81.2             0.906
       LR                 87.8             0.913                     GPC                80.8             0.904
      ETC                 85.4             0.912                     LC                 72.1             0.893
       BR                 87.8             0.911                     DC                 79.5             0.874
     SGDC                 86.0             0.910                    CCM                 75.5             0.864
      PLS                 86.0             0.905                     BM                 79.0             0.812
     LSVC                 86.6             0.905                     WD                 62.0             0.742
     OMP                  88.4             0.899                     DT                 80.8             0.727
     KNN                  82.3             0.881                     AC                 55.0             0.711
      GPC                 82.9             0.860                         System “Recognition” – Task 3
      ABC                 83.5             0.856                    CCA                 87.2             0.906
      GNB                 76.2             0.831                     GPC                87.2             0.904
      DTC                 84.1             0.813                     LC                 86.0             0.893
        L                 69.5               0.5                     DC                 87.2             0.874
       EN                 69.5               0.5                    CCM                 87.2             0.864
       LL                 69.5               0.5                     BM                 86.6             0.812
                                                                     WD                 81.1             0.742
    The Table 2 includes the results of efficiency of                DT                 85.4             0.727
algorithms ensembles methods estimation. The                         AC                 82.3             0.711
following notations of algorithms were used                              System “Recognition” – Task 4
(“Recognition” package [8]):                                         LC                 50.7             0.847
• AC – the algebraic corrector (quadratic merit                      BM                 55.2             0.840
    functional; minimal mean deviation = 0);                         WD                 55.2             0.827
• CS – the convex stabilizer (function type –                       CCA                 61.2             0.815
    Gaussian);                                                      CCM                 63.4             0.787
• WD – the Woods dynamic method (number of                           DT                 59.0             0.745
    objects in vicinity = 10);                                       DC                 60.4             0.651
• CCA – the complex committee method⎯averaging;                      GPC                59.7             0.646
• CCM – the complex committee method⎯majority                        AC                 52.2             0.646
    voting;                                                            Scikit-learn in Python [9] – Task 1
• BM – the Bayes method;                                             VCS                94.4             0.889
• CAS – the clustering and selection (number of                     VCH                 93.7             0.867
    clusters = 3);                                                     Scikit-learn in Python [9] – Task 3
                                                                     VCS                87.2             0.852


                                                        155
     Algorithm         CV accuracy, %           AUC                      Ispol’zovanie baz dannykh i metodov
         VCH                  86.6              0.836                    iskusstvennogo intellekta (Computer Design of
    The collective decision-making methods use algorithms                Inorganic Compounds: Use of Databases and
for which AUC-values were marked by boldfaced types (see                 Artificial Intelligence Methods). Moscow:
Table 1). We used «default option»-mode for choosing of                  Nauka. 2005.
algorithms parameter values.                                        [2] N.Y. Chen, W.C. Lu, J. Yang, G.Z. Li, Support
    It should be noted that in most cases the choice of the              vector machine in chemistry. Singapore: World
most accurate single ML methods using the cross-                         Scientific Publishing Co. Pte. Ltd. 2004.
validation and the ROC-analysis coincides. The best                 [3] N.N. Kiselyova. Computer design of materials
algorithms (according to AUC-value) (see Table 1) are                    with artificial intelligence methods. In
methods based on the support vector machine (SVM), the                   Intermetallic Compounds. Principles and
deadlock test (DTA), the artificial neural network                       Practice, Vol.3, Westbrook, J.H. & Fleischer,
learning (ANN), as well as the linear machine (LM), the                  R.L. eds., p. 811-839, Chichester, UK: John
statistical weighted syndromes (SWS), and the two-                       Wiley&Sons, Ltd. 2002.
dimensional linear separators (TLS). The Gradient                   [4] T. Mueller, A.G. Kusne, R. Ramprasad.
Boosting (GBC) crowds the top of the list in Scikit-learn                Machine Learning in Materials Science. Recent
package. The worst algorithms are the binary decision                    Progress and Emerging Applications. Reviews
tree learning (BDT), the search for the best test (SBT),                 in Computational Chemistry, 29, p. 186–273,
the linear Fisher discriminant (LDF), the Elastic Net                    2016.
(EN), and the Lasso (L and LL).
                                                                    [5] N.N. Kiselyova, A.V. Stolyarenko, V.A.
    The most efficient algorithms ensembles (see Table
                                                                         Dudarev. Machine Learning Methods
2) are the complex committee method⎯averaging (CCA),
                                                                         Application to Search for Regularities in
the logic corrector (LC), the generalized polynomial
                                                                         Chemical Data. Selected Papers of the XIX
corrector (GPC), and the voting (VC). In most cases the
                                                                         International Conference on Data Analytics and
algorithms ensembles application allows a prediction
                                                                         Management in Data Intensive Domains
accuracy increase.
                                                                         (DAMDID/RCDL 2017). Moscow, Russia,
                                                                         October 9-13, 2017. CEUR Workshop
5 Conclusions                                                            Proceedings, v.2022, p. 375-380, 2017.
    The problem of the most accurate algorithms                          http://ceur-ws.org/Vol-2022/paper57.pdf.
selection belongs to the most important tasks of ML. To             [6] N.N. Kiseleva. Prediction of the new
solve this task the subject field peculiarities must be                  compounds in the systems of halogenides of the
taken into account. In this research the ML-software                     univalent and bivalent metals. Russian Journal
from «Recognition» and «Scikit-learn» packages were                      of Inorganic Chemistry, 59(5), p. 496–502,
tested in inorganic compounds prediction tasks. As a                     2014.
rule, small sizes of training samples in these tasks do not         [7] N.N. Kiselyova, A.V. Stolyarenko, V.V.
allow a selection of representative objects subset for                   Ryazanov, O.V. Sen’ko, A.A. Dokukin.
examinational recognition. In that context the cross-                    Prediction of New Halo-Elpasolites. Russian
validation using training sample is the most acceptable                  Journal of Inorganic Chemistry. 61(5), p. 604-
procedure for ML algorithms accuracy estimation. The                     609, 2016.
substantial difference in numbers of different classes of           [8] O.V. Senko. An Optimal Ensemble of
objects is a peculiarity of inorganic chemical tasks.                    Predictors in Convex Correcting Procedures.
Therefore, the ROC-analysis is the most acceptable                       Pattern Recognition and Image Analysis. 19(3),
method for these algorithms accuracy evaluation.                         p. 465-468, 2009.
    Acknowledgments. This work was partially
                                                                    [9] Yu. I. Zhuravlev, V. V. Ryazanov, and O. V.
supported by the Russian Foundation for Basic Research
                                                                         Sen’ko. RECOGNITION. Mathematical
(project nos. 17-07-01362 and 18-07-00080) and State
                                                                         methods. Software system. Practical solutions.
assignments No. 007-00129-18-00 and 0063-2020-0003.
                                                                         Moscow: Phasis. 2006.
                                                                    [10] Pedregosa et al. Scikit-learn: Machine Learning
References                                                               in Python, JMLR 12, pp. 2825-2830, 2011.
   [1] N.N. Kiselyova. Komp’yuternoe
        konstruirovanie neorganicheskikh soedinenii.


                                                              156

</pre>