<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Application of Decision Trees to Detect Process Disruptions in Aluminum Production*</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of computational modelling of the Siberian Branch of the Russian Academy of Sciences</institution>
          ,
          <addr-line>50/44 Akademgorodok, Krasnoyarsk, 660036</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Siberian Federal University</institution>
          ,
          <addr-line>26, Kirenskogo str., Krasnoyarsk, 660074</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper considers the task of elaborating the tools that enable early detection of process disruptions in aluminum production using the technology of decision trees. The suggested method to forecast the process disruptions are based on the data on daily average process indicators. The method includes a necessary stage of preliminary processing of inputs and consequent construction of a math model. The study defined the most informative properties, solved the problem of unbalanced data, and compared approaches based on decision trees. The quality metrics revealed the most effective method to solve the set task.</p>
      </abstract>
      <kwd-group>
        <kwd>Decision Trees</kwd>
        <kwd>Random Forest</kwd>
        <kwd>Gradient Boosting</kwd>
        <kwd>Process Disruptions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Aluminum production is a strategically important industry of economy. Competitive
performance of this industry is primarily determined by reaching high technical and
economic indexes, which, in their own turn, depend on the technology quality control
and timely estimation of the technical condition of both separate units and the entire
complex of aluminum production as a whole. Process disruptions that occur in the
cycle of aluminum production are the main impediment on the way towards reaching
the highest technical and economic indexes, regarding both the potline and the whole
enterprise. Unfortunately, the causes of ineffective operation of cells are investigated
after the event. However, extensive introduction of hardware/software packages to
control the production flow allows considerable volumes of monitoring data to be
accumulated and used in decision-making. Therefore, it is becoming particularly
relevant to develop tools for early detection of process disruptions based on the
monitoring data and troubleshooting the causes of reductions in productivity of cells.
* Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
The
most common
process
disruptions in
aluminum
production include:
occurrence of anode effect, formation of coal froth, and distortion of the anode
surface relief. The latter is considered to be the gravest disruption [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: a “spike” is a
buildup at the anode bottom of a regular cylindrical or conical shape; “lagging” is a
bulge of the anode face of a rectangular shape, or an irregularity that covers up to
5060% of the anode area; “overglow” is a buildup at the bottom of the anode of an
irregular shape (sphere, mushroom, etc.) that is formed around any side of the anode
unit. Such process disruptions manifest themselves only in advanced stages when the
buildups at the bottom develop a bulge that is embedded in the cathodic metal, which
is always accompanied by changes in the process parameters. As a rule, these
problems trigger changes in the values of such process parameters. Methods of
machine learning as they are applied to the monitoring data will reveal specific
interdependencies across the data and, on this basis, allow for the identification of
process disruptions in the cell operation.
      </p>
      <p>The identification of process disruptions can be viewed as a binary classification
task. The accuracy of its solution depends on the volume and quality of inputs
collected during the monitoring stage, selected methods of classification, diagnostic
quality criteria, and criticality of the controlled function indicators. The body of the
article is structured as follows. Section 2 spells out the classification objective.
Section 3 describes the inputs. Section 4 gives a description of the applied input data
preprocessing methods. Section 5 elaborates on the applied classification algorithms.
Section 6 presents the results of how the classification models operate.</p>
    </sec>
    <sec id="sec-2">
      <title>Research Objective</title>
      <p>2
The classification task is set as follows. Let us assume that there is a set of objects
=  ( ), … ,  ( ) , each characterized by an m-dimensional vector of attributes
 ( ) =</p>
      <p>( ), … ,  ( ) ,  = 1,  . Every object under study is attributed to a certain class
∈  = { , … ,  },  = 1,  . In this case, classification is aimed at the following. It
requires a rule (algorithm) to be formulated  : 
→  , so that based on a setpoint
value of attributes new unknown objects could be attributed to one of the classes.</p>
      <p>As it relates to the problem of early detection of process disruptions based on the
monitoring data, the task of classification boils down to dividing the states of a
process facility into two classes: operative 
= 0 and faulty 
= 1 (functioning with
errors). The input data samples are used as a basis for an algorithm that must be able
to use the set operative indicators of a given facility to diagnose its state with
sufficiently high accuracy.</p>
      <p>Binary classification tasks normally use the following indicators as their metrics:
─ accuracy is a relation of all correctly classified objects to the total number of all
classified objects:

=
(3)
Here, TP stands for true-positive results (objects classified as “positive” and that are
actually positive, i.e. belong to class  = 1), TN stands for true-negative results
(objects classified as “negative” and that are actually negative, i.e. belong to class  =
0), FP stands for false-positive results (objects classified as “positive” but that are
actually negative, i.e. belong to class  = 0), FN stands for false-negative results
(objects classified as “negative” but that are actually positive, i.e. belong to class  =
1).</p>
      <p>In case of imbalance between the classes, regular accuracy is replaced with balanced
accuracy:


=</p>
      <p>=
─ precision is a relation of all objects classified as “positive” and that are actually
positive to the total number of objects classified as “positive”:
(2)
(3)
(4)
(5)
Precision characterizes the ability of a given prediction model to correctly classify
positive objects in relation to the number of all objects classified as “positive”.
─ recall is a relation of all objects classified as “positive” and that are actually
positive to the total number of actually positive objects:
Recall characterizes the ability of a given prediction model to correctly classify
positive objects from the set of all positive objects combined.
─ F1 score is a harmonic mean between the values of precision and recall:
 1 = 2 ∗
∗
F-score demonstrates how many cases are classified by the model correctly, and how
many true items can be classified by the model correctly.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data Description</title>
      <p>The software/hardware package aimed at forecasting process disruptions is based on
the daily average data collected through monitoring the operation of cell series in
potrooms No. 9 and 10 of the Khakas aluminum smelter for the period from 2017 to
2019.</p>
      <p>The set of controlled cell operation indicators consists of 40 process parameters,
including the following: duration of metal tapping (sec), metal level (cm), electrolyte
level (cm), electrolyte temperature (°С), alumina dose (kg), bath chemistry
parameters, parameters of the point feeding system for alumina and aluminum
fluoride, parameters of adjustment the anode-to-cathode distance, amperage (kA),
voltage parameters, back EMF (V), state and service life of cells (month), as well as
registered process disruptions: number of anode effects, number of “spikes” and
“lagging”.</p>
      <p>In this study, the prediction model is developed for one of the process disruptions,
namely the anode effect. The input array contains about 300,000 entries.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Data Analysis and Processing</title>
      <p>The inputs are processed in several stages that include filling missing values of object
attributes, identifying and deleting errors, selecting informative attributes, and
normalizing their values.</p>
      <p>
        The first stage entails processing incomplete data. At first, attributes with the
number of gaps that exceeded the set threshold were deleted (over 50% of entries),
then the rest of the data underwent reconstruction of missing values by the
EMalgorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The biggest number of serially interpolated data is 5. The rest of the
entries with missing values were deleted. In addition, the data also underwent a
correlation analysis, with collinear attributes removed from the set (one of the
attributes was removed when the correlation coefficient exceeded 0.8).
      </p>
      <p>Next, the method of quartiles was used to identify the outliers – the values of
removed outliers were replaced with upper or lower quartile values.</p>
      <p>
        The inputs in the samples are unbalanced. The number of entries “with no
disruptions” is significantly higher than those “with disruptions”. One way to tackle
the issue is to use various sampling strategies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: undersampling, oversampling, and
the hybrid method that uses both strategies simultaneously. The undersampling
technique balances the data by removing samples in majority class. The oversampling
technique adds the synthetic data samples to minority class. The hybrid method
combines both approaches: one – to create additional samples in minority class, the
other – to remove those samples that may lead to overfitting. The study presented in
this paper used the hybrid method [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The next stage featured selection of the most informative attributes. Presence of
uninformative attributes among the samples causes overfitting. Attributes were sorted
out by the method of recursive feature elimination (RFE) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] combined with the
random forest algorithm. The RFE method relies on consecutive construction of
models, when a new model is built in every cycle, with the least informative features
being eliminated. As a result, there was a set of the most significant features that
contained 15 parameters ready to be used for detecting process disruptions.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Description of Algorithms</title>
      <p>
        Some of the most common methods of machine learning used for building diagnostic
models are decision trees [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], ensembles of algorithms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], artificial neural networks
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], fuzzy logic algorithms [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], etc. in this study, the diagnostic model for detection of
process disruptions was built on the basis of decision trees.
      </p>
      <p>
        A decision tree is a model that presents rules for making decisions in a hierarchical
sequence structure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Decision trees consist of roots that carry the conditions to be
tested (attributes), and leaves that contain outcomes (one of two classes).
      </p>
      <p>The decision tree model employs the principle of recursive decomposition of
object space into subsets. Every node, starting at the root, has an attribute that is
selected as a division basis for decomposition of all data into 2 classes. The process
runs until the stopping criterion gets activated.</p>
      <p>
        The objects are classified using a decision tree by moving top-down from root to
leaf in accordance with the conditions set at each top [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        A random forest is an algorithm used in machine learning that incorporates an
ensemble of decision trees [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. An ensemble of decision trees is a set of decision
trees in which each of the trees is built as per the samples that result from the original
ones using the bootstrap technique [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The qualification result based on the random
forest algorithm is determined through voting (the class that has been forecasted by
the biggest number of trees is selected as true).
      </p>
      <p>
        Boosting in decision trees is an algorithm applied in machine learning that uses an
ensemble of decision trees [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Unlike the technique employed in the random forest
algorithm, boosting combines an ensemble of weak classifiers to convert it into a
stronger classifier. The point is that every sequential decision tree is trained based on
the data about errors in the previous decision tree (i.e. every sequential decision tree
corrects the errors of preceding ones and boosts the quality of the entire ensemble).
As of now, the boosting algorithm with decision trees has a number of modifications.
One of the most effective of them is gradient boosting [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The math models and algorithms to monitor the state of process facilities were
developed using Python tools. Python has a large number of libraries for machine
learning that employ various classification algorithms, including approaches that
feature decision trees.</p>
      <p>
        The current study uses the following methods of classification on the basis of
decision trees:
─ decision tree (scikit-learn library) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ];
─ gradient boosting XGBClassifier (XGboost library) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ];
─ gradient boosting CatBoostClassifier (Catboost library) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ];
─ boosting in unbalanced data RUSBoostClassifier (imbalanced-learn scikit-learn
library) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ];
─ random forest in unbalanced data BalancedRandomForestClassifier
(imbalancedlearn scikit-learn library) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Hyperparameters in the models were set up with the help of the random search
method with cross-validation. Parameters for the models were selected based on the
principle of maximum accuracy. Tables 1-5 show optimal values of the main
hyperparameters for each model.
The method of gradient boosting enables the setup of weighting factors when it comes
to dealing with unbalanced data. The scale_pos_weight hyperparameter makes it
possible to control the balance among classes by setting up the weights for every
class. Class weights were selected on the basis of the metaheuristic algorithm of
global optimization of orb-weaving spiders (Araneidae algorithm, AA) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Analysis and Comparison of Results</title>
      <p>To train the model, the samples were split into the training and testing sets in the ratio
2:1. What came out after the abovementioned metrics had been calculated is quoted
for each of the algorithms in Table 6. The best results in the considered metrics were
demonstrated by the Catboost algorithm.
The paper presents the results of the study aimed at the development of tools for early
detection of process disruptions in aluminum production using the decision tree
technology. The suggested model predicts the process disruptions in aluminum
production based on the information about the daily average process indicators. The
method includes the compulsory stage of preprocessing of inputs and further
construction of a math model. The study revealed the most informative attributes,
solved the problem of unbalanced data, and compared a number of approaches. The
math models and diagnostic algorithms were performed for one of the process
disruptions, namely, the anode effect. The best results among the analyzed approaches
were shown by the Catboost algorithm. In the future, for more accurate forecasting,
process disruptions are planned to be predicted using the ensemble method with
multiple learning algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Puzanov</surname>
            ,
            <given-names>I.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zavadyak</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klykov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makeev</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plotnikov</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          :
          <article-title>Continuous monitoring of information on anode current distribution as means of improving the process of controlling and forecasting process disturbances</article-title>
          .
          <source>J. Sib. Fed. Univ. Eng. technol. 9</source>
          (
          <issue>6</issue>
          ).
          <fpage>788</fpage>
          -
          <lpage>801</lpage>
          (
          <year>2016</year>
          ).
          <source>doi: 10.17516/1999-494X-2016-9-6-788-801</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Pigott</surname>
          </string-name>
          , T.D.:
          <article-title>A review of methods for missing data</article-title>
          .
          <source>Educational research and evaluation. 7</source>
          (
          <issue>4</issue>
          ).
          <fpage>353</fpage>
          -
          <lpage>383</lpage>
          (
          <year>2010</year>
          ).
          <source>doi: 10.1076/edre.7.4.353.8937</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Ma, Y. (Eds.):
          <article-title>Imbalanced Learning</article-title>
          . Foundations, Algorithms, and Applications. Wiley-IEEE Press (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Batista</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A.L.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monard</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Balancing Training Data for Automated Annotation of Keywords: a Case Study</article-title>
          .
          <source>In: WOB</source>
          . pp.
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Rokach</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maimon</surname>
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Data Miningwith Decision Trees</article-title>
          .
          <source>Theory and Applications</source>
          . London: World Scientific Publishing Co (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Ensemble machine learning: methods and applications</article-title>
          . USA: Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep learning</article-title>
          . Cambridge: MIT press (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aliev</surname>
          </string-name>
          , R.:
          <article-title>Fuzzy logic Theory and applications</article-title>
          . London: World Scientific Publishing Co (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Decision trees and random forests</article-title>
          .
          <source>Blue Windmill Media</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Natekin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoll</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Gradient Boosting Machines</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Tutorial</surname>
          </string-name>
          .
          <source>Frontiers in neurorobotics 7</source>
          (
          <issue>21</issue>
          ). (
          <year>2013</year>
          ). doi:
          <volume>10</volume>
          .3389/fnbot.
          <year>2013</year>
          .00021
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Buhlmann</surname>
          </string-name>
          , P. Bagging,
          <article-title>boosting and ensemble methods</article-title>
          .
          <source>Handbook of computational statistics: Concepts and methods</source>
          . Berlin: Springer. pp.
          <fpage>877</fpage>
          -
          <lpage>907</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>12. Decision Tree Classifier, https://scikitlearn.org/stable/modules/generated/sklearn.tree. DecisionTreeClassifier.htm/l</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>13. XGBoost Documentation, https://xgboost.readthedocs.io/en/latest/index.html</mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>14. CatBoost, https://catboost.ai/</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <article-title>Imbalanced-learn documentation</article-title>
          , https://imbalancedlearn.readthedocs.io/en/stable/ index.html
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Baranov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lugovaya</surname>
            ,
            <given-names>N.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikhalev</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kudymov</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strekaleva</surname>
            ,
            <given-names>T.V.</given-names>
          </string-name>
          :
          <article-title>The algorithm of overall optimization based on the principles of intraspecific competition of orb-web spiders</article-title>
          .
          <source>In: Proceedings of the II Conference on Advanced Technologies in Aerospace, Mechanical and Automation Engineering - MIST: Aerospace-2019</source>
          , vol.
          <volume>734</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>