=Paper=
{{Paper
|id=Vol-3047/paper2
|storemode=property
|title=Application of Association Rule Mining to Detect “Spike” Disruptions in Aluminum Production
|pdfUrl=https://ceur-ws.org/Vol-3047/paper2.pdf
|volume=Vol-3047
|authors=Anna V. Korobko,Anna M. Metus,Dmitry Ogurtsov,Tatiana Penkova,Iliya Puzanov,Andrey Zavadyak
}}
==Application of Association Rule Mining to Detect “Spike” Disruptions in Aluminum Production==
Application of Association Rule Mining to Detect “Anode Spike” Disruptions in Aluminum Production Anna Korobko1, Anna Metus2, Dmitry Ogurtsov3, Tatiana Penkova2, Iliya Puzanov4 and Andrey Zavadyak4 1 Reshetnev Siberian State University of Science and Technology, 31 Krasnoyarsky Rabochy ave., Krasnoyarsk, 660037 Russian 2 Institute of Computational Modelling of the Siberian Branch of the Russian Academy of Sciences, 50/44 Akademgorodok, Krasnoyarsk, 660036, Russia 3 Moscow Institute of Physics and Technology (National Research University), 1 “А” Kerchenskaya st., Moscow, 117303, Russia 4 RUSAL Engineering and Technology Center, 37/1 Pogranichnikov st., Krasnoyarsk, 660111, Russia Abstract The article focuses on applying association rule mining to predict the “Anode spike”-type process disruptions using daily average monitoring data from a series of reduction cells in the experimental area of the Sayanogorsk Aluminum Smelter. The data were binarized according to different criteria for grouping the values of the process parameters into ranges: statistical norms, quartiles, and ranges which are attributed to the occurrence of disruptions. Prediction models were built as a set of association rules. The quality metrics aided in defining the optimum parameters to be used in the model settings, with the results of its validation presented as well. The model selected for the implementa-tion uses binarization based on quartiles. The resulting validation values suggest that the model is effective enough for practical use. Keywords 1 Detection of disruptions, data mining, association rule mining, data binarization, aluminum production 1. Introduction High technical and economic performance in the aluminum industry is largely defined by the operation quality of the process. One of the gravest process disruptions which leads to a significant drop in the metal yield is the anode surface deformation which may be of various types [1]. A “spike” is a buildup of a regular cylindrical or conical shape at the anode bottom; “lagging” is a bulge of a rectangular shape on the anode face, or an irregularity which covers up to 50-60% of the anode area; “overglow” is a buildup of an irregular shape (sphere, mushroom, etc.) at the bottom of the anode which is formed around any side of the anode unit. Currently, such defects are only discovered at a very advanced stage. What causes them is still unclear. There are several hypotheses which are yet to be experimentally validated by means of data mining techniques applied to the monitoring data for reduction cells [1]. A common approach to analyze monitoring data coming from different production processes is to apply association rule mining, a data mining technique which looks for patterns in big data [2-7]. Association rules aid in identifying operation modes leading to higher rates of flawed items [3, 4], finding correlations between various types of defects [5], and predicting the output of the finished product [6]. Applying association rule mining in aluminum production makes it possible to reveal SibDATA 2021: The 2nd Siberian Scientific Workshop on Data Analysis Technologies with Applications 2021, June 25, 2021, Krasnoyarsk, Russia EMAIL: gglhroom@gmail.com (A. Korobko); metus@icm.krasn.ru (A. Metus) ORCID: 0000−0001−5337−3247 (A. Korobko); 0000-0003-0547-5999 (A. Metus); 0000-0002-0057-0535 (T. Penkova) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) patterns across the values of the controlled parameters which are indicative of process disruptions and predict their occurrence in the future. This paper presents the results of using association rule mining to predict “spike”-type process disruptions based on the daily average data from a series of the reduction cells in the experimental area of the Sayanogorsk Aluminum Smelter as monitored from 01.01.2019 to 30.11.2020. At the stage of preprocessing, based on various criteria the data were binarized according to whatever ranges the process parameters fell into. Prediction models were built as a set of the association rules. Finally, the quality metrics allowed defining the optimum parameters to be used in the model settings, with the demonstration of the validation results. 2. Data preprocessing: binarization The inputs which make up the association rules are comprised of binarized (or categorized) data with the daily average values of the controlled parameters, to result in a set of elements characterizing the state of the process on a given day. Binarization is performed in two steps. Firstly, the values of each parameter are grouped into non- overlapping ranges. Secondly, a binary matrix is formed by classifying the data according to the ranges they belong to. The resulting matrix is made up of columns which represent the parameter ranges and rows of 1 or 0, depending on whether the parameter value falls into the range in question. The ranges were defined both generally and individually. Tackled generally, the data from the reduction cells are considered in combination, and the ranges of the parameter values are found for the entire production as a whole. When approached individually, the ranges are determined for each reduction cell separately, taking into account their specific properties. The criteria to sort out the values into ranges are as follows: statistical norms (STDDEV), quartiles (QUARTILES), and ranges indicative of disruptions (HISTOGRAMS) [8]. To perform binarization based on the statistical norm (STDDEV), one needs to determine the standard deviation for a given dataset. The standard deviation is found as μ ± σ, where μ stands for the mean average of the dataset, and σ represents the standard deviation. It forms three ranges: [x ; x ), [x ; x ], (x ; x ], where x is the minimum value in the dataset, x = μ − σ denotes the standard deviation on the left, x = μ + σ denotes the standard deviation on the right, x stands for the maximum value in the dataset. Binarization by means of quartiles (QUARTILES) is made by analyzing the number of values in the dataset. It forms four ranges: [𝑥 ; 𝑥 ), [𝑥 ; 𝑥 ), [𝑥 ; 𝑥 ), [𝑥 ; 𝑥 ], where 𝑥 is the minimum value in the dataset, 𝑥 is the first quartile, 𝑥 is the second quartile, 𝑥 is the third quartile, and 𝑥 is the maximum value in the dataset. To binarize data based on the ranges that are most indicative of the process disruption (HISTOGRAMS), the dataset is broken into two subsets: one containing days when the disruptions occurred, another includes the days free of disruptions. The disruption days include the day a disruption was reported on, and five preceding days. The values are distributed across the non- overlapping ranges on the basis of the frequency distribution for each dataset, to identify the areas where one dataset consistently dominates the other ones. The number of ranges in this case varies: [𝑥 ; 𝑥 ), [𝑥 ; 𝑥 ), …, [𝑥 ; 𝑥 ], where 𝑥 is the minimum value in the dataset; 𝑥 is the maximum value in the dataset; while 𝑥 , 𝑥 , 𝑥 denote other ranges. This method of binarization proves reasonable when two and more ranges need to be found. As a result of binarization, each process parameter is associated with a few binary parameters. The resulting ranges make up a binary matrix which is then used to find association rules. 3. Building a prediction model The association rule represents a statement of the following type: “Ifthen », which is interpreted as a cause-and-effect correlation, where is the part of the rule on the left, or an antecedent, and is the part on the right, or a consequent. The association rules are calculated in two main steps: 1) searching for data for the patterns (itemset) that appear with the pre-set frequency; 2) making up rules from the found datasets. To identify how frequently itemsets appear in the data, the following criteria are used: support and confidence. Support is the indication of how frequently a given pattern shows up in the full set. Confidence indicates how frequently the antecedent and consequent co-occur. There is another metric to measure significance (lift) which is used to check whether the ‘if-then’ statements are true, or, in other words, how much the event on the right side of the rule results from that on the left. If the rules are drawn from all the possible itemsets, there may be too many of them, and thus, the calculations are constrained by one of the coefficients. In contrast to the traditional algorithms to generate association rules which identify various atypical patterns, the suggested model identifies the telltale signs of emerging deviations in the operation process considering only the days which preceded the day the process disruption was found (i.e. it considers the time when the disruption only started to form). The rule-generating algorithm is based on the application of the concept lattice theory (Formal concept analysis) [9, 10]. Formal concepts are derived from the resulting binary matrix. The formal concept is a pair of sets (𝐴, 𝐵), where 𝐵 is a combination of the binary parameters which partially or fully describes the day before the disruption was discovered. The key element to 𝐵 is the target parameter which stands on the right side of the rule, and the rest of the binary parameters stands for what is on the left. The element 𝐴 represents a set of events which are covered by the created rule. The size of the set |A| allows calculating the rule support coefficient juxtaposing it with the size of the input set of the events (both with favorable and undesired outcomes). The confidence coefficient is determined by associating the rule support coefficient with the frequency of the consequence, separately from the antecedent. Both coefficients help evaluate the strength of the rule (lift). In this regard, the rules with high confidence and low support indicate a rare combination of the parameters which are seen only in one or two events. Conversely, the rules with low confidence and high support describe a frequent pattern which is found both on the disruption days and disruption-free days. Figure 1 shows a scatter plot for the rules worked out for reduction cell No. 6 in the experimental area РА-550. The training set included the data from 2019 when there were 10 “spike”-type disruptions detected in the reduction cell. The algorithm with STDDEV binarization generated 577 association rules. In the graph, the set of rules with the highest confidence is highlighted in yellow. Cardinality Support Confidence Figure 1: The scatter plot for the rules in Reduction Cell No. 6 in the Experimental Area РА-550, 2019 The association rules based on the analysis of the preceding events allows determining the ranges of values and combinations of the controlled parameters which describe the state of the process before the disruption was detected, considering the specific characteristics of the reduction cells. The resulting set of rules can be used to predict process disruptions and evaluate the probability of an unfavorable event. 4. Validating and adjusting the prediction model The predictive accuracy of the model is determined using the following indicator: 𝑇𝑃 + 𝑇𝑁 (1) 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = , 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 where 𝑇𝑃 is the true positive outcome; 𝐹𝑃 is the false positive outcome; while 𝐹𝑁 stands for the false negative result, and 𝑇𝑁, for the true negative one. At the same time, the percentage of True Positives is found as a ratio of the 𝑇𝑃 outcomes to the number of entries which correspond to the days prior to disruptions, while the percentage of False Positives is defined as a ratio of the 𝐹𝑁 results to the number of entries which describe the disruption- free days as per the current day and the one that follows. The main parameters used in the model tuning are the following: type of the rule formation (individual or general approach): ind_type = (true, false); type of binarization: bin_type = (STDDEV, QUARTILES, HISTOGRAMS); data completeness (with or without the consideration of empty values): keep_nan = (true, false); range of values (with or without the consideration of the “routine” parameter values): anomalies_only = (true, false). The model was validated based on the monitoring data from Reduction Cell No. 7 in the Experimental Area РА-550. The training dataset includes entries for the period of 019.01.01- 2020.01.01, namely 730 entries, and 13 “spike”-type disruptions. The test dataset is comprised of entries for the period of 2020.01.01-2020.05.01, including 121 entries, and 18 “spike”-type disruptions. The results of the model validation with different tuning parameters are presented below. Model 1: 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒 = 𝑡𝑟𝑢𝑒; 𝑏𝑖𝑛_𝑡𝑦𝑝𝑒 = 𝑆𝑇𝐷𝐷𝐸𝑉; 𝑘𝑒𝑒𝑝_𝑛𝑎𝑛 = 𝑡𝑟𝑢𝑒; 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠_𝑜𝑛𝑙𝑦 = 𝑡𝑟𝑢𝑒. Number of rules – 101. How the model performs is demonstrated in Figure 2. On the left is a graph which shows the changes in the percentage of the true and false positive predictions, depending on the confidence threshold value, and on the right is a diagram which juxtaposes the prediction results with the actual events of the process disruptions. The maximum accuracy of the model amounts to 0.85 (i.e. 85% of true predictions) with the confidence threshold value of 0.61, 𝑇𝑃/𝐹𝑁 = 8/10; 𝑇𝑁/𝐹𝑃 = 95/8 . Refining the model tuning parameters – 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒 = 𝑓𝑎𝑙𝑠𝑒, 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠_𝑜𝑛𝑙𝑦 = 𝑓𝑎𝑙𝑠𝑒 – decreases the overall accuracy and results in a higher number of false positive predictions. The maximum accuracy is achieved by increasing the number of true positive outcomes: 𝑇𝑁 = 99, 𝑇𝑃 = 0. Tru e -p o si t ive False-p o si t ive Pro p o rt io n Co n f i d en ce Figure 2: The performance of Model 1 Model 2: 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒 = 𝑡𝑟𝑢𝑒; 𝑏𝑖𝑛_𝑡𝑦𝑝𝑒 = 𝑄𝑈𝐴𝑅𝑇𝐼𝐿𝐸𝑆; 𝑘𝑒𝑒𝑝_𝑛𝑎𝑛 = 𝑡𝑟𝑢𝑒; 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠_𝑜𝑛𝑙𝑦 = 𝑡𝑟𝑢𝑒. Number of rules – 557. Figure 3 shows the model performance. Its maximum accuracy is 0.88 with the threshold value of 0.41, 𝑇𝑃/𝐹𝑁 = 13/5; 𝑇𝑁/𝐹𝑃 = 93/10 . At the confidence threshold values within the range of 0.41-0.42, the percentage of true positive predictions amounts to 72%, while that of False Positives is equal to 11%. When the model tuning parameters are changed for 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒 = 𝑓𝑎𝑙𝑠𝑒, 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠_𝑜𝑛𝑙𝑦 = 𝑓𝑎𝑙𝑠𝑒, there is a drop in the overall accuracy, with a higher number of false positive predictions. Tru e-p o sit ive False-p o sit ive Co n f i d en ce Pro p o rt io n Co n f i d en ce Con f i den ce Figure 3: The performance of Model 2 Model 3: 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒 = 𝑡𝑟𝑢𝑒; 𝑏𝑖𝑛_𝑡𝑦𝑝𝑒 = 𝐻𝐼𝑆𝑇𝑂𝐺𝑅𝐴𝑀𝑆; 𝑘𝑒𝑒𝑝_𝑛𝑎𝑛 = 𝑡𝑟𝑢𝑒; 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠_𝑜𝑛𝑙𝑦 = 𝑡𝑟𝑢𝑒. Number of rules – 270. The performance of the model is demonstrated in Figure 4. Its accuracy reaches 0.86 at the confidence threshold of 0.51, 𝑇𝑃/𝐹𝑁 = 1/17; 𝑇𝑁/𝐹𝑃 = 103/0 . The performance results suggest that the prediction error is rather high, while the evaluations of the true positive and false positive predictions proves to be similar. The high confidence of the rule may indicate a disruption which is currently developing and thus, has not been detected yet. Tru e -p o sit ive False-p o sit i ve Co n f i d en ce Pro p o rt io n Co n f i d en ce Con f i den ce Figure 4: The performance of Model 3 Table 1 shows the results of testing the model with different types of data binarization. It presents the dates preceding the days the “spike”-type disruptions were reported on, and the corresponding confidence of the operating rule which is interpreted as the probability of a disruption to occur if the controlled parameters show the values as observed. The validation results show that the predictive accuracy of the model is higher with the individual approach. When selecting the parameters at the model tuning stage, it is important to consider the data completeness and to look at the entire range of values of the process parameters. To a great extent, the accuracy of the given model depends on the period covered by the training set. The conditions of the technology process do change over time, so do the ranges of values and combinations of the parameters characterizing the occurring disruptions, which inevitably affects the results of their prediction. Among the models with different types of binarization, the best results were demonstrated by the one with quartile-based binarization. Therefore, the model with the following parameters was chosen to be implemented: 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒 = 𝑡𝑟𝑢𝑒; 𝑘𝑒𝑒𝑝_𝑛𝑎𝑛 = 𝑡𝑟𝑢𝑒; 𝑎𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠_𝑜𝑛𝑙𝑦 = 𝑡𝑟𝑢𝑒; 𝑏𝑖𝑛_𝑡𝑦𝑝𝑒 = 𝑄𝑈𝐴𝑅𝑇𝐼𝐿𝐸𝑆. Table 1 The results of testing the association rule Date STDDEV HISTOGRAMS QUARTILES 14.08.2020 0.545 0.5 1 18.08.2020 0.375 0.666 0.75 23.08.2020 0 0.417 0.625 04.09.2020 1 0 1 07.09.2020 0.375 1 1 11.09.2020 1 0.429 1 09.10.2020 0 1 0.6 11.10.2020 0 0.615 1 13.10.2020 0 0.412 0.385 31.10.2020 0.429 0 1 04.11.2020 0 0 0.5 10.11.2020 0.375 0 1 14.11.2020 0.5 0.429 1 15.11.2020 0 0,5 1 18.11.2020 1 1 1 19.11.2020 0.625 0.833 1 22.11.2020 0.4 0.5 1 23.11.2020 0.714 0.5 1 5. Conclusion The article presents the results of the association rule technology applied to predicting “Anode spike”-type process disruptions on the basis of daily average monitoring data from a series of reduction cells in the experimental area of the Sayanogorsk Alumi-num Smelter. At the preprocessing stage, the data were binarized using various criteria to divide the values of the process parameters into ranges: statistical norms, quartiles, and ranges indicative of disruptions. The predictive models were built as a set of association rules. The testing results suggest that the predictive accuracy is higher with the individual approach. Moreover, it is crucial to take into account the completeness of the data and consider the entire range of the values of the process parameters. The accuracy of the model proved to largely depend on the period covered by the training set because the very conditions in which the given process runs change considerably over time. The model with the quartile-based binarization was selected for the implementation. The validation results indicate that the model is of rather high quality for practical use. Further research into monitoring inputs is required to obtain a higher predictive accuracy. 6. References [1] Yu. G. Mikhalev, P. V. Polyakov, A. S. Yasinsky, S. G. Shakhrai, A. I. Bezrukikh, A. V. Zavadyak, Causes of Process Disruptions Involving Anodes. Review of Russian and Overseas Experimental Data, SFU Journal. Engineering and Technologies 10(5) (2017) 593–606. [2] J. Treinen, T. Ramakrishna, A Framework for the Application of Association Rule Mining in Large Intrusion Detection Infrastructures, in: Proceedings 9th International Symposium on Recent Advances in Intrusion Detection, LNCS, volume.4219, 2006, pp. 1–18. [3] J. Jeon, S. Y. Sohn, Product failure pattern analysis from warranty data using association rule and Weibull regression analysis: A case study, Reliability Engineering and System Safety 133 (2015) 176–183. [4] J. Kim, H. Hwangbo, Real-Time Early Warning System for Sustainable and Intelligent Plastic Film Manufacturing, Sustainability 11(5) (2019) 1490. [5] K. Wongwan, W. Laosiritaworn, Application of Association Rules in Woven Wire Mesh Defects Analysis, in: 7th International Conference on Industrial Technology and Management, 2018, pp. 325–329. [6] H. T. Hu, R. Z. Zhang, X. Guan, Application on Crude Oil Output Forecasting Based on TB- SCM Algorithm, in: Proceedings 5th International Conference on Electronics Information and Emergency Communication (ICEIEC), 2015, pp. 398–401. [7] J. Kim, Y. Lee, Progress of Technological Innovation of the United States’ Shale Petroleum Industry Based on Patent Data Association Rules, Sustainability 12(16) (2020) 6628. [8] A. Metus, T. Penkova, Analysis of Aluminium Electrolysis Data in the Context of Extreme Values of Technological Parameters, in: CEUR Workshop Proceedings, volume. 2727, 2020, pp. 92–98. [9] B. Ganter, R.Wille, Formal Concept Analysis: mathematical Foundations, Springer-Verlag, Berlin Heidelberg, New York, 1999. [10] R. Wille, Restructuring Lattice Theory: an approach based on hierarchies of concept, Reidel, Dordrecht-Boston, 1982.