<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Various Machine Learning Methods Efficiency Comparison in Application to Inorganic Compounds Design</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>© O.V. Sen'ko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>© N.N. Kiselyova</string-name>
          <email>kis@imet.ac.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>© V.A. Dudarev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>© A.A. Dokukin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>© V.V. Ryazanov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal Research Center “Computer Science and Control“ of the Russian Academy of Sciences</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institution of Russian Academy of Sciences A.A. Baikov Institute of Metallurgy and Materials Science RAS</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>152</fpage>
      <lpage>156</lpage>
      <abstract>
        <p>Various machine learning methods («Recognition» package and «Scikit-learn» package for Python) accuracy comparison was made on example of inorganic chemistry tasks solution. The crossvalidation and the ROC-analysis were applied to accuracy estimation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Machine learning (ML) methods are widely used in the
inorganic compounds formation predicting and their
properties estimation [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1-7</xref>
        ]. The paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] contains a
statistical analysis of popularity of various ML methods
that applied to inorganic materials science. However, in
spite of these methods success for numerous tasks
solution in this subject field, no effort of accuracy
comparison of wide variety of methods was made using
ROC-analysis.
      </p>
      <p>
        To solve this task the subject field particularities must
be taken into account. In particular, it is obvious that an
attribute description has a composite structure: the set of
chemical elements parameters (the components of an
inorganic substance) is repeated as many times as there
are elements included into the compound. Due to
periodical dependence of chemical elements properties
on their atomic numbers the strong correlation within
sets of each component parameters is observed. Relative
informativeness of individual element’s properties is
low. For this reason, the simpler compounds properties
(e.g., simple oxides, halogenides, chalcogenides, etc.) as
well as the algebraic functions of components’ properties
are used. Although these parameters are studied very
well but there are gaps of properties’ values (incomplete
data). They are filled in a variety of ways. For example,
the periodic dependences of elements’ parameters on
their atomic numbers and the appropriate interpolation
and extrapolation are used. The large asymmetry of
training sample sizes for different classes is a peculiarity
in inorganic chemistry tasks. Very often the least
representative classes (as a rule – newly discovered
classes of substances) are the most interesting point to
chemists. The experimental errors and discrepancies of
inorganic compounds classification in training samples
are yet another problem at compound design that
decreases a prediction accuracy drastically. Doubtless,
that an accuracy depends on attribute description
informativeness and training sample representativeness.
Therefore, to evaluate various ML methods we have
chosen a number of tasks with highly reliable predictions
(more than 85 % according to the later experimental
verification) [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Prediction accuracy estimation methods</title>
      <p>The cross-validation (CV) on training sample of objects
is the most widely used universal and reliable tool for
machine learning quality estimation. At that a number of
recognition error can be taken into account. However,
one of the problems in ML accuracy estimation task is
the recognition efficiency determination in the
asymmetrical classes case where the number of different
classes objects differs significantly. This situation is very
common in cases when only a very few new materials
were obtained with the important practical properties and
a search for analogues of these substances that are not yet
synthesized allows an experimental researches time and
cost reduction. In the majority of ML methods
application cases the standard decision rule minimizes
the total number of erroneous predictions. It results in
good recognition of compounds from the large class and
in bad recognition of substances representing small class.
As a result, the overall recognition accuracy gives poor
notion of the efficiency of one or another method or one
or another attribute description. The Receiver Operating
Characteristic (ROC) analysis application is an
alternative approach. It allows a recognition accuracy
comparison for the targeted and alternative classes at
variation of cut-offs which identifies belonging to
different classes.</p>
      <p>
        The following prediction accuracy estimation
procedures were used in this analysis fulfilling. The
available training sample is divided into two
nonintersecting stratified subsamples which were later
used to train and assess simple and collective methods
independently. Further, the ROC-analysis is carried-out
and the Area Under Curve (AUC) measure is calculated.
As a rule, in collective decision making the methods with
AUC more than some fixed threshold value is used in
3 The test tasks
3.1 Prediction of formation of compounds with the
composition A2BCHal6 (A and C are various
monovalent metals; B are trivalent metals; and Hal
is F, Cl, or Br) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
2 classes:
4. formation of the compound – 744 examples;
5. nonformation of the compound – 170 examples.
      </p>
      <p>
        137 attributes including 3 the most informative
algebraic functions of the initial attributes.
3.2 Prediction of formation and crystal structure
type of compounds with composition A2BCHal6 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
4 classes:
1. elpasolites – 283 examples;
2. compounds with the Cs2NaCrF6 crystal structure type
– 19 examples;
3. another crystal structure types – 57 examples;
4. nonformation of the compound – 83 examples.
      </p>
      <p>
        134 attributes.
3.3 Prediction of formation of compounds with the
composition ABHal3 (A are various monovalent
metals; B are bivalent metals; Hal is F, Cl, Br, or I)
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2 classes:
1. formation of the compound – 237 examples;
2. nonformation of the compound – 107 examples.
      </p>
      <p>
        88 attributes.
3.4 Prediction of formation and crystal structure
type of compounds with composition ABHal3 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
6 classes:
1. perovskites – 46 examples;
2. compounds with the GdFeO3 crystal structure type –
20 examples;
3. compounds with the CsNiCl3 crystal structure type –
38 examples;
4. compounds with the NH4CdCl3 crystal structure type
– 23 examples;
5. another crystal structure types – 39 examples;
6. nonformation of the compound – 111 examples.
      </p>
      <p>88 attributes.</p>
      <p>
        The most important attribute sets were selected using the
program based on the method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>4 The analysis of obtained results</title>
      <p>
        The Table 1 contains the efficiency estimation results for
single machine learning methods. The following
algorithms notations were used (“Recognition” package
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]):
• ECA – the estimates calculation algorithm (fixed size
of support sets = 1), Leave-One-Out CV (LOOCV);
•
•
      </p>
      <p>
        SBT – the search for the best test (maximal number
of ε- thresholds for one attribute = 5; maximal size of
sample = 20; number of samples of the same size =
3; percent of tests using in recognition – 10 %; unitary
weights), LOOCV;
TLS – the two-dimensional linear separators method
(bias step – 0; right part components – 0.1; number of
iterations – 10000; number of start iteration – 100;
percentage of removed objects – 1; step – 100;
threshold of regularity selection – 80 %), 10-fold CV;
BDT – the binary decision tree learning (maximal
number of nodes (interior nodes) – 15; minimal
significant value of entropy reduction – 0.2; minimal
number of objects in leaf nodes – 5), LOOCV;
LDF – the linear Fisher discriminant (confidence
threshold for correlation coefficient – 0), LOOCV;
LM – the linear machine method (bias step – 0; right
part components – 0.1; number of iterations – 10000;
number of start iteration – 100; percentage of
excluding objects – 1; step – 100), LOOCV;
LoReg – the voting algorithm where estimations for
classes are calculated with the help of voting by
logical regularities system (“greedy” way; number of
intervals - 5; maximal number of iterations – 100000;
beginning of removal – 100; percentage of removed
inequalities – 1%; removal step– 100; minimal rate of
objects – 0.1; number of random permutations – 3),
10-fold CV;
MNN – the multiplicative neural network algorithm
(number of iterations – 1000), LOOCV;
MP – the multilayer perceptron (neural network
configuration: number of hidden layers – 3 (number
of neurons in layer - 10); number of training iterations
– 3000; activation function – sigmoid; training speed
– 0.1; moment of inertia – 0; lack of criterion function
increase if there is no increase during last 1000
iterations then the speed is decreased in 2 times),
10fold CV;
ANN – the artificial neural network learning using
back-propagation (neural network configuration:
number of hidden layers – 3 (number of neurons in
layer - 10); number of training iterations – 500;
activation function – sigmoid; training speed – 0.1;
threshold – 0.1; lack of criterion function increase if
there is no increase during last 100 iterations then the
speed is decreased in 2 times), 10-fold CV;
KNN – the k-nearest neighbors method (number of
nearest neighbors – 1; prior class probabilities are
took into account), LOOCV;
SVM – the support vector machine (penalty
coefficient – 5; kernel function type – Gaussian;
kernel function parameter – 6; maximal number of
iterations – 500, 10-fold CV);
SWS – the statistical weighted syndromes (rapid
mode; number of partition borders – 1; optimized
criteria threshold – 4.5; representativeness threshold
– 0.5; instability threshold - 0.2; denial zone – 0.1),
10-fold CV;
DTA – the deadlock test algorithm (test searching
algorithm – effective; divisor of ε- thresholds = 2;
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
maximal size of sample = 20; the number of
subsamples of the same size = 3), LOOCV.
“Scikit-learn package for Python” [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] - 10-fold CV:
LIR – linear_model.LinearRegression;
R – linear_model.Ridge;
L – linear_model.Lasso;
EN – linear_model.ElasticNet;
LL – linear_model.LassoLars;
OMP – linear_model.OrthogonalMatchingPursuit;
BR – linear_model. BayesianRidge;
HR –linear_model. HuberRegressor;
KR – KernelRidge;
PLS – PLSRegression;
SGDC – linear_model.SGDClassifier;
P –linear_model.Perceptron;
PACH – the passive aggressive classifier
(loss='hinge');
PACS – the passive aggressive classifier
(loss='squared_hinge');
LSVC – linear SVC;
NSVC1 – nuSVC (nu=0.1);
NSVC3 – nuSVC (nu=0.3);
LR – linear_model.LogisticRegression;
GPC – Gaussian process classifier;
GNB – Gaussian naive Bayes;
DTC – tree.DecisionTreeClassifier;
KNN – KNeighborsClassifier (n_neighbors=5);
MP – neural_network.MLPClassifier;
BC – ensemble.BaggingClassifier;
RFC – ensemble.RandomForestClassifier;
ETC – ensemble.ExtraTreesClassifier;
ABC – ensemble.AdaBoostClassifier;
      </p>
      <p>GBC – ensemble.GradientBoostingClassifier.</p>
      <sec id="sec-3-1">
        <title>Algorithm</title>
        <p>ANN
KNN
LM
MNN
TLS
LDF
MP
SBT
ECA
BDT
0.845
0.822
0.816
0.804
0.801
0.799
0.788
0.772
0.767
0.737
0.733
0.700
0.675
0.607
0.865
0.857
0.850
0.847
0.843
0.836
0.832
0.803
0.780
0.756
0.742
0.725
0.684
0.959
0.951
0.948
0.945
0.935
0.930
0.927
0.911
0.907
0.905
0.905
0.902
0.900
0.895
0.895
0.886
0.880</p>
      </sec>
      <sec id="sec-3-2">
        <title>Algorithm</title>
        <p>HR
PACH
PACS
SGDC
PLS</p>
        <p>P
DTC
GNB
EN
LL
L
0.935
0.935
0.931
0.925
0.923
0.917
0.917
0.917
0.916
0.916
0.916
0.913
0.913
0.913
0.912
0.911
0.910
0.905
0.905
0.899
0.881
0.860
0.856
0.831
0.813
0.5
0.5
0.5</p>
        <p>
          The Table 2 includes the results of efficiency of
algorithms ensembles methods estimation. The
following notations of algorithms were used
(“Recognition” package [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]):
• AC – the algebraic corrector (quadratic merit
functional; minimal mean deviation = 0);
• CS – the convex stabilizer (function type –
        </p>
        <p>Gaussian);
• WD – the Woods dynamic method (number of
objects in vicinity = 10);
• CCA – the complex committee method⎯averaging;
• CCM – the complex committee method⎯majority
voting;
• BM – the Bayes method;
• CAS – the clustering and selection (number of
clusters = 3);
•
•
•</p>
      </sec>
      <sec id="sec-3-3">
        <title>Algorithm CV accuracy, % AUC</title>
        <p>VCH 86.6 0.836</p>
        <p>The collective decision-making methods use algorithms
for which AUC-values were marked by boldfaced types (see
Table 1). We used «default option»-mode for choosing of
algorithms parameter values.</p>
        <p>It should be noted that in most cases the choice of the
most accurate single ML methods using the
crossvalidation and the ROC-analysis coincides. The best
algorithms (according to AUC-value) (see Table 1) are
methods based on the support vector machine (SVM), the
deadlock test (DTA), the artificial neural network
learning (ANN), as well as the linear machine (LM), the
statistical weighted syndromes (SWS), and the
twodimensional linear separators (TLS). The Gradient
Boosting (GBC) crowds the top of the list in Scikit-learn
package. The worst algorithms are the binary decision
tree learning (BDT), the search for the best test (SBT),
the linear Fisher discriminant (LDF), the Elastic Net
(EN), and the Lasso (L and LL).</p>
        <p>The most efficient algorithms ensembles (see Table
2) are the complex committee method⎯averaging (CCA),
the logic corrector (LC), the generalized polynomial
corrector (GPC), and the voting (VC). In most cases the
algorithms ensembles application allows a prediction
accuracy increase.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 Conclusions</title>
      <p>The problem of the most accurate algorithms
selection belongs to the most important tasks of ML. To
solve this task the subject field peculiarities must be
taken into account. In this research the ML-software
from «Recognition» and «Scikit-learn» packages were
tested in inorganic compounds prediction tasks. As a
rule, small sizes of training samples in these tasks do not
allow a selection of representative objects subset for
examinational recognition. In that context the
crossvalidation using training sample is the most acceptable
procedure for ML algorithms accuracy estimation. The
substantial difference in numbers of different classes of
objects is a peculiarity of inorganic chemical tasks.
Therefore, the ROC-analysis is the most acceptable
method for these algorithms accuracy evaluation.</p>
      <p>Acknowledgments. This work was partially
supported by the Russian Foundation for Basic Research
(project nos. 17-07-01362 and 18-07-00080) and State
assignments No. 007-00129-18-00 and 0063-2020-0003.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.N.</given-names>
            <surname>Kiselyova</surname>
          </string-name>
          .
          <article-title>Komp'yuternoe konstruirovanie neorganicheskikh soedinenii. Ispol'zovanie baz dannykh i metodov iskusstvennogo intellekta (Computer Design of Inorganic Compounds:</article-title>
          <source>Use of Databases and Artificial Intelligence Methods)</source>
          . Moscow: Nauka.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Support vector machine in chemistry</article-title>
          .
          <source>Singapore: World Scientific Publishing Co. Pte. Ltd</source>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.N.</given-names>
            <surname>Kiselyova</surname>
          </string-name>
          .
          <article-title>Computer design of materials with artificial intelligence methods</article-title>
          .
          <source>In Intermetallic Compounds. Principles and Practice</source>
          , Vol.
          <volume>3</volume>
          ,
          <string-name>
            <surname>Westbrook</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Fleischer</surname>
          </string-name>
          , R.L. eds., p.
          <fpage>811</fpage>
          -
          <lpage>839</lpage>
          , Chichester, UK: John Wiley&amp;Sons, Ltd.
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.G.</given-names>
            <surname>Kusne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramprasad</surname>
          </string-name>
          .
          <source>Machine Learning in Materials Science. Recent Progress and Emerging Applications. Reviews in Computational Chemistry</source>
          ,
          <volume>29</volume>
          , p.
          <fpage>186</fpage>
          -
          <lpage>273</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.N.</given-names>
            <surname>Kiselyova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Stolyarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.A.</given-names>
            <surname>Dudarev</surname>
          </string-name>
          .
          <article-title>Machine Learning Methods Application to Search for Regularities in Chemical Data. Selected Papers of the XIX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL</article-title>
          <year>2017</year>
          ). Moscow, Russia, October 9-
          <issue>13</issue>
          ,
          <year>2017</year>
          . CEUR Workshop Proceedings, v.
          <year>2022</year>
          , p.
          <fpage>375</fpage>
          -
          <lpage>380</lpage>
          ,
          <year>2017</year>
          . http://ceur-ws.
          <source>org/</source>
          Vol-2022/paper57.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.N.</given-names>
            <surname>Kiseleva</surname>
          </string-name>
          .
          <article-title>Prediction of the new compounds in the systems of halogenides of the univalent and bivalent metals</article-title>
          .
          <source>Russian Journal of Inorganic Chemistry</source>
          ,
          <volume>59</volume>
          (
          <issue>5</issue>
          ), p.
          <fpage>496</fpage>
          -
          <lpage>502</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.N.</given-names>
            <surname>Kiselyova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Stolyarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.V.</given-names>
            <surname>Ryazanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Sen</surname>
          </string-name>
          <article-title>'ko,</article-title>
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Dokukin</surname>
          </string-name>
          . Prediction of New Halo-Elpasolites.
          <source>Russian Journal of Inorganic Chemistry</source>
          .
          <volume>61</volume>
          (
          <issue>5</issue>
          ), p.
          <fpage>604</fpage>
          -
          <lpage>609</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Senko</surname>
          </string-name>
          .
          <article-title>An Optimal Ensemble of Predictors in Convex Correcting Procedures</article-title>
          .
          <source>Pattern Recognition and Image Analysis</source>
          .
          <volume>19</volume>
          (
          <issue>3</issue>
          ), p.
          <fpage>465</fpage>
          -
          <lpage>468</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yu. I.</given-names>
            <surname>Zhuravlev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Ryazanov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O. V.</given-names>
            <surname>Sen</surname>
          </string-name>
          <article-title>'ko</article-title>
          .
          <source>RECOGNITION. Mathematical methods. Software system. Practical solutions</source>
          . Moscow: Phasis.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Pedregosa</surname>
          </string-name>
          et al.
          <source>Scikit-learn: Machine Learning in Python, JMLR 12</source>
          , pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>