Transparently mining data from a medium-voltage distribution network: a prognostic-diagnostic analysis Matteo Nisi Daniela Renga Daniele Apiletti Danilo Giordano Department of Electronics Department of Electronics Department of Control and Department of Control and and Telecommunications and Telecommunications Computer Engineering Computer Engineering Politecnico di Torino, Italy Politecnico di Torino, Italy Politecnico di Torino, Italy Politecnico di Torino, Italy m.nisi@studenti.polito.it daniela.renga@polito.it daniele.apiletti@polito.it danilo.giordano@.polito.it Tao Huang Yang Zhang Marco Mellia Elena Baralis Department of Energy Department of Energy Department of Electronics Department of Control and Politecnico di Torino, Italy Politecnico di Torino, Italy and Telecommunications Computer Engineering tao.huang@polito.it yang.zhang@polito.it Politecnico di Torino, Italy Politecnico di Torino, Italy marco.mellia@polito.it elena.baralis@polito.it ABSTRACT such dataset, and the capability to model system degradation, With the shift from the traditional electric grid to the smart grid are unknown, we address the predictive task by means of an paradigm, huge amounts of data are collected during system exploratory predictive maintenance analysis. To this aim, two operations. Data analytics become of fundamental importance in exploratory approaches are applied: a statistical data character- power networks to enable predictive maintenance, to perform isation approach, and a transparent exhaustive method based effective diagnosis, and to reduce related expenditures. The final on association rule mining. The latter, automatically extracts all goal is to improve the electric service efficiency and reliability to correlations, above specific thresholds, among SCADA events oc- the benefit of both the citizens and the grid operators themselves. curring before each fault of interest (prognostic), and separately, This paper considers a dataset collected over 6 years in a real- after the faults (diagnostic). Quality metrics are exploited to high- world medium-voltage distribution network by the Supervisory light the most meaningful correlations. Finally, human-readable Control And Data Acquisition (SCADA) system. A transparent, patterns describing such correlations are investigated. exploratory, and exhaustive data-mining approach, based on as- To the best of our knowledge, our work is the first study that sociation rule extraction, is applied to automatically identify investigates both the prognostic and diagnostic capabilities of a correlations among SCADA events occurring before and after real-world historical dataset collected by a Supervisory Control specific service interruptions, i.e., distribution network faults of and Data Acquisition (SCADA) system in an electric grid, with interest. Therefore, both the prognostic and the diagnostic poten- respect to the occurrence of severe service interruptions. Thanks tials of the dataset are investigated with respect to the occurrence to the application of an exhaustive analysis methodology, by ex- of permanent service interruptions. Our results highlight a lim- tracting association rules among faults and events, we addressed ited predictive capability of the available set of SCADA events, the issue of providing smart grid operators an assessment of the while they can be effectively exploited for diagnostic purposes. exploitation potential of currently available datasets for predic- tive maintenance and diagnosis. The proposed methodology can be applied to similar datasets from any grid operator. 1 INTRODUCTION Electric grid operators welcome predictive maintenance to avoid the costs of scheduled inspections and reactive maintenance in- 2 DATASET terventions. To this aim, datasets describing the electric grid The dataset under analysis contains events recorded by the SCADA operations, with historical data about failures and alarm signals, system of a leading Italian grid operator, on its medium-voltage are under investigation. Although this data has been collected distribution network. The dataset is recorded over a period of for different purposes, companies are interested in determining 6 years (2010-2016), covering two northeastern Italian regions their predictive maintenance capability: to reduce management (Veneto and Friuli-Venezia-Giulia). The dataset is characterised by costs, to speed up intervention-time, and to improve efficiency 3,901 faults of interest, 30 different affected components, 153,094 and reliability. general SCADA events of network operations. The SCADA events For our study, we rely on a big data dataset spanning over 6 years, are divided into 67 different event types, with the generic fail- collected by a leading Italian electric grid operator. The dataset de- ure event type accounting 79,833 events. The faults of interest scribes the operations of a medium-voltage distribution network correspond to those: (i) lasting more than 180 seconds, (ii) with in northeastern Italy, and it records events and failure through the location in the network identified, and (iii) with the cause the Supervisory Control And Data Acquisition (SCADA) system. determined. These events are named Permanent Service Inter- Our aim is to assess whether this dataset could be exploited to (i) ruptions (SIPs), tagged with a cause among 45 different reasons predict future electric network failures (predictive maintenance) and linked to one among the 30 affected components. and/or (ii) effectively diagnose the failures after it is reported We briefly characterise the dataset by analysing the distribu- by the maintenance system. Since the predictive capability of tion of SIPs causes and types of SCADA events. © 2019 Copyright held by the author(s). Published in the Workshop Proceedings Figure 1a reports the probability distribution of the most fre- of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal), on quent causes of SIPs among the 45 available: the top 4 causes CEUR-WS.org. account 75% of the SIPs, with “electric fault” being the most DARLI-AP 2019, , M. Nisi et al. frequent cause (45%). More than 20% of SIPs are due to natu- ral causes, such as: weather issues, plant falls, snow overload, 1.0 PFW - 30d wind, and animal contact. All these causes are unpredictable with- 0.9 PFW - 7d out contextual knowledge outside the electrical grid operational 0.8 PFW - 1d events. Furthermore, another 20% of SIPs are due to unknown 0.7 AFW - 1h “other causes” (second most frequent value). 0.6 AFW - 1d CCDF Figure 1b reports the probability distribution of the most com- 0.5 AFW - 7d mon SCADA events types. The distribution is skewed, with about 0.4 75% of SCADA events belonging to just 6 different types, and 0.3 with the most frequent one with a frequency above 30%. 0.2 0.1 0.5 0.0 0.4 0 5 10 15 20 25 30 Frequency 0.3 SCADA Events 0.2 0.1 0 El Ot Pl Th At Cu Sn Th Me W An Flo Un Figure 2: Complementary Cumulative Distribution Func- ec h a h i tric er c nt f ird p mo stom ow o ird p cha nd im al od ce rta tion (CCDF) of the number of SCADA events registered fau aus all ar s a n t fa phe er f verl rt d ical co ing in es nta lt ult ric ault oad am fau ev en ag e lt ct during various lengths of PFWs and AFWs. ts Type of SCADA event (a) Causes of faults (SIPs) Figure 2 reports the Complementary Cumulative Distribution Function (CCDF) of the number of SCADA events registered dur- 0.3 Frequency ing the PFWs (continuous curves) and the AFWs (dotted curves). 0.2 Comparing the CCDFs of PFWs and AFWs, in almost 90% of the 0.1 AFWs at least one SCADA event is observed, even within 1 hour; 0 Pe ter so Op en He a ing vy g rven Int e Op en Op e Op e ing ning ning ning ning Op e Op e Pe rm an RT U de instead, in 50% of the 7-day PFWs and in 60% of the 1-day PFWs, nc of ro tio of o w of en tec MV und nn MV f MV ith MV t op oil int er ve lin ef d isp on −r e lin ef lin ef D R A lin ef en tv olt ing age no SCADA events are registered at all. Furthermore, PFW curves ntio or e rsio s olv or or n ot or gr ab n m ax cu n2 (los in g g ro u n df m ax cu w or k ing g ro un d2 ou nd se 1s ce n show a more gradual descent with respect to the AFW: SCADA rre so au rre nd t thr nt 3r dt f in su la lt 1 st t hr nt 2 nd t h re s es ho ld events are more likely to follow a SIP rather than preceding the hr es tio n) es thre ho ld sh ho ld ho ld old fault of interest. This data-driven intuition is also confirmed by (b) Types of SCADA events domain knowledge: many types of SCADA events are known to be triggered by a SIP. Figure 1: Frequency distribution of the values of (a) causes Finally, the 1-hour AFW curve shows a steeper descend than of faults and (b) types of SCADA events. the longer-lasting AFWs, but with the same starting (leftmost) values: most SCADA events are typically observed within the first hour after a SIP, and then few events are collected after 1 or 7 days. On the contrary, the curves of the 7-day AFW and the 3 PROGNOSTIC-DIAGNOSTIC APPROACH 30-day AFW show larger differences, since few events are col- Since this work aims at investigating both the prognostic and lected in the immediately preceding days of a SIP. Most SCADA diagnostic potential of SCADA events with respect to SIPs, we events occurring before a SIP are registered in the previous 1-7 focus on the analysis of those events occurring both before and days. Although few additional events are observed considering after a SIP, in the same portion of the network, under the as- a 30-day-PFW, we also note that a higher number of SCADA sumption that the time and space correlations might capture events in the PFW correlates with a higher probability of regis- causalities of the system. tering another non-permanent service interruption during the same PFW (results missing due to space limitations, partially 3.1 Pre-Fault and After-Fault Windows discussed in Section 3.2), so a significant portion of the 30-day- In the time dimension, we define a time window preceding the PFW events could be ideally associated to AFWs of those minor occurrence of a SIP, denoted as Pre-Fault Window (PFW), and a service interruptions. time window immediately following the SIP, denoted as After- All considerations tend to suggest a limited prognostic poten- Fault Window (AFW). In the space dimension, we consider only tial of the SCADA events with respect to SIPs due to fewer events, SCADA events observed in the same portion of the network more time-unrelated, also considering the high variety of SCADA where the SIP occurs, i.e., reported by the same feeder as origin event types. Conversely, the diagnostic exploitation seems better of the collected data, since according to the domain experts they supported by more data, nearer to the event of interest. are more likely to be correlated to the considered SIP. Considering that the grid operator is interested in predicting 3.2 Inter-Fault Window future SIPs occurring within the next month at most, the time We define Inter-Fault Window the time interval between two windows are defined with the following variable lengths: 1-7-30 consecutive faults on the same portion of the network, denoted days for PFW, and 1 hour, 1 day or 7 days for AFW. These values as IFW. The aim of such analysis is to determine how many events result from wider preliminary analyses, with the aim of capturing following a SIP, i.e., in its AFW and inherently diagnostic, are behaviours of the distribution network at different time scales of also included in a PFW before another SIP, thus being modelled interest for domain experts of the electric grid company. also as prognostic features. Both SIPs and other minor Service Transparently mining data from a medium-voltage distribution network DARLI-AP 2019, , 1 current work, the attribute is either a SCADA event type, or an 0.8 alleged cause, or a failed component, and the value is 1 if that Frequency 0.6 attribute is true in the time window under exam (e.g., the SCADA 0.4 event is present, the component failed, or the specific cause was 0.2 Case A determined), or 0 otherwise. Note that a SCADA event might Case B represent another SIP or a minor fault occurring before or after 0 the analyzed SIP. An itemset I is a set of co-occurring events, 0 10 20 30 40 50 60 70 80 90 10 0 Time interval [days] failed components, and alleged causes among the records r in the dataset D. Such set of items I in a PFW or, separately, in an AFW Figure 3: IFW lengths of various types of faults. constitutes the input feature vector of the rule mining extraction. The support count of an itemset I is the number of records r containing I . The support s(I ) of an itemset I is the percentage of Interruptions generate diagnostic SCADA events in their AFWs, records r containing I with respect to the total number of records hence different IFWs can be defined, depending on the type of r in the full dataset D. An itemset is frequent when its support is faults considered (SIPs only or all Service Interruptions). Figure 3 greater than or equal to a minimum support threshold MinSup. shows the probability distribution of the duration of two types Association rule mining aims at identifying collections of item- of IFWs: sets (i.e., sets of co-occurring events) that are frequently present • Case A (dotted green curve): IFW between each pair of in the dataset under analysis, according to statistically relevant consecutive SIPs. metrics. The extracted rules are all and only those adhering to • Case B (continuous red curve): IFW between each regis- the thresholds of statistical relevance defined as parameters of tered SIP and the immediately preceding Service Interrup- the mining process, hence being an exhaustive, thus powerful, tion of any type (either SIP or not). exploratory approach within the boundaries of the problem for- In 80% of cases, the IFW between two consecutive SIPs lasts mulation (i.e., itemset definition and threshold settings). more than 40 days, and there is only a 7% probability that two Association rules are usually represented in the form X → Y , SIPs are separated by an interval of less than 7 days (Case A). where X (rule antecedent) and Y (rule consequent) are disjoint Hence, with a 7-day PFW, we limit the interference of AFWs itemsets (i.e., they include different attributes). To identify the of other SIPs into the PFW of the current SIP under analysis, most meaningful rules among those extracted by the mining by guaranteeing that prognostic and diagnostic events are kept process, quality measures can be exploited as ranking criteria. separate for different SIPs. The following popular quality measures are used in the current However, in Case B, the duration of the IFW between a SIP work: rule support, confidence, and lift. Rule support s(X , Y ) is the and the immediately preceding Service Interruption lasts up to percentage of records containing both X and Y . It represents the 30 days in almost 60% of the cases, with the probability of having prior probability of X ∪ Y , i.e., the support of the corresponding an IFW shorter than 7 days risen to 26%, three-fold with respect itemset I = X ∪Y in the dataset. Rule confidence is the conditional to Case A. Hence, there exist SCADA events registered during a probability of finding Y given X . It describes the strength of the s(X ∪Y ) PFW preceding a SIP that are generated as a consequence, i.e., in implication and is given by c(X → Y ) = s(X ) [5]. the AFW, of a previously occurring Service Interruption. All and only association rules with support and confidence above (or equal to) a support threshold MinSup and a confidence 3.3 Challenges threshold MinCon f are to be extracted. Among those surviv- From the time-window-based data characterization, the following ing the thresholds, a rank based on descending support, con- takeaways can be identified: fidence and lift values can drive the attention to focus on the • 60% of the SIPs have no SCADA events in their 7-day PFW. most statistically-relevant patterns. The lift [5] of a rule X → Y • 10% of the SIPs have no SCADA events in their 1-day AFW. measures the (symmetric) correlation between antecedent and • Most diagnostic events occur in the 1-hour AFW. consequent, and it is defined as follows. • Many apparently-prognostic events occur more then 1 c(X → Y ) s(X → Y ) week before the SIP (PFW), however, they include events lift(X, Y) = = (1) generated as a consequence of other minor faults, i.e., they s(Y) s(X) · s(Y) are in the AFW of non-permanent Service Interruptions, In Equation (1), c(X → Y ) and s(X → Y ) are the rule confidence in 60% of the cases for a 30-day PFW, and in 26% of cases and support; s(X ) and s(Y ) are the supports of the rule antecedent for a 7-day PFW. and consequent, respectively. If lift(X ,Y )=1, itemsets X and Y are not correlated, i.e., they are statistically independent. Lift values 4 RULE MINING below 1 show a negative correlation between itemsets X and Y , To address challenges identified in Section 3.3, we exploited a while values above 1 indicate a positive correlation, with higher transparent, exhaustive and exploratory data mining approach: lift indicating stronger rules, hence typically more meaningful association rule mining. The technique and its evaluation metrics, and interesting correlations. as required by the scope of the current work, are defined as follows. 4.2 Rule quality analysis The analysis of the extracted rules has been performed for various 4.1 Association Rule Extraction parameter values. Due to space constraints, we report only the Let D be a dataset whose generic record r consists of a set of co- most meaningful results based on the rules obtained by (i) setting occurring events, i.e., events that occur in the same time window. MinSup 0.02, then focusing on rules (ii) whose lift is higher than Each event, also called item, is a couple (attribute, value). In the 1.5, and (iii) having a cause or component as conclusion. DARLI-AP 2019, , M. Nisi et al. The number of rules resulting from such selection have been re- 6 CONCLUSIONS ported in Figures 4a-4b for a 7-day PFW. They are scatter-plotted The work analysed 6 years of data recorded from a medium- according to support, confidence and lift values. For comparison, voltage distribution network, with the purpose of estimating the same results have been reported in Figures 4c-4d for an AFW both the prognostic and diagnostic potential for severe faults, i.e., of 1 day. The diagnostic potential (AFW) is confirmed by a larger permanent service interruptions. Time-window data characteri- number of correlations with better quality metrics with respect sation and exhaustive rule-mining results confirm the capability to the prognostic capability (PFW): of the collected data to support diagnostic tasks, whereas their • 45 rules extracted in the AFW vs 3 in the PFW. prognostic potential is limited since only few and poor predictive • 50% max rule confidence in AFW vs 25% in PFW. correlations are present in the data. Future works include wider • 2.73 max lift value in AFW vs 1.9 in PFW. analyses of the rules for different thresholds and changes into • 8% max support in AFW vs 4.5% in PFW. the transactional dataset derived from the raw data to enable the extraction of additional correlations. Finally, further investiga- Eventually, top rules according to lift, confidence and support tions of the predictive capability will be performed by testing the have been inspected by domain experts from the grid company, effectiveness of the obtained rules in detecting actual failures. allowing to transparently evaluate the correlation model and the prognostic-diagnostic approach. ACKNOWLEDGMENT The research leading to these results has been funded by Enel Italia, e-distribuzione, and the SmartData@PoliTO center for Confidence 0.8 0.1 Support 0.4 0.05 Data Science technologies and applications. 0 0 0.1 -1 1 0.1 1 REFERENCES Lift Lift-1 [1] Y. Cai and M. Chow. 2009. Exploratory analysis of massive data for distribution fault diagnosis in smart grids. In 2009 IEEE Power Energy Society General (a) PFW: Confidence vs lift−1 (b) PFW: Support vs lift−1 Meeting. 1–6. Confidence 0.8 0.1 [2] Q. Cui, K. El-Arroudi, and G. Joos. 2017. An effective feature extraction Support 0.4 0.05 method in pattern recognition based high impedance fault detection. In 2017 19th International Conference on Intelligent System Application to Power Systems 0 0 (ISAP). 1–6. 0.1 1 0.1 1 -1 [3] Enrico De Santis, Lorenzo Livi, Alireza Sadeghian, and Antonello Rizzi. 2015. Lift Lift-1 Modeling and Recognition of Smart Grid Faults by a Combined Approach of Dissimilarity Learning and One-class Classification. Neurocomput. 170, C (c) AFW: Confidence vs lift−1 (d) AFW: Support vs lift−1 (Dec. 2015), 368–383. [4] Huaiguang Jiang, Xiaoxiao Dai, Wenzhong Gao, Jun Zhang, Yingchen Zhang, Figure 4: Association rules extracted from the 7-day PFW and Eduard Muljadi. 2016. Spatial-Temporal Synchrophasor Data Characteri- zation and Analytics in Smart Grid Fault Detection, Identification and Impact (a-b) and from the 1-day AFW (c-d), with causes or compo- Causal Analysis. IEEE Transactions on Smart Grid 7 (09 2016), 1–1. nents as conclusion (x-axis in log scale). [5] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining. Addison-Wesley. [6] Chunming Tu, Xi He, Zhikang Shuai, and Fei Jiang. 2017. Big data issues in smart grid – A review. Renewable and Sustainable Energy Reviews 79 (2017), 1099 – 1107. [7] Jian Wang. 2016. Early warning method for transmission line galloping based 5 RELATED WORK on SVM and AdaBoost bi-level classifiers. IET Generation, Transmission and Distribution 10 (November 2016), 3499–3507(8). Issue 14. With the shift from the traditional electric grid to the Smart Grid [8] Xiaoyu Wang, Stephen McArthur, Scott Strachan, John D. Kirkwood, and paradigm, data analytics and related applications are becoming Bruce Paisley. 2017. A Data Analytic Approach to Automatic Fault Diagnosis and Prognosis for Distribution Automation. IEEE Transactions on Smart Grid of fundamental importance in power networks, as shown by the PP (05 2017), 1–1. https://doi.org/10.1109/TSG.2017.2707107 several studies available in the literature focusing on this topic [9] Yang Zhang, Tao Huang, and Ettore Francesco Bompard. 2018. Big data [6, 9]. However, few research efforts have been specifically de- analytics in smart grids: a review. Energy Informatics 1, 1 (2018), 8. [10] Y. Zhang, Y. Xu, Z. Y. Dong, Z. Xu, and K. P. Wong. 2017. Intelligent Early voted to predictive maintenance. Some studies aim at performing Warning of Power System Dynamic Insecurity Risk: Toward Optimal Accuracy- fault detection in power networks, based on historical weather Earliness Tradeoff. IEEE Transactions on Industrial Informatics 13, 5 (Oct 2017), data mining [7], on extreme learning machine models [10], or 2544–2554. on electrical feature extraction techniques [2]. Authors in [4] deploy an effective method to detect faults in smart grids, trading off the need for reducing the huge volume of available collected data, related to the Phasor measurement unit, and the need for keeping critical information. Other studies aim not only to detect faults, but also to further characterise them by identifying and exploiting significant features. Classifiers based on clustering and dissimilarity learning techniques [3] or on feature extraction algorithms [1] are used to analyse massive data to perform fault recognition or distribution fault diagnosis. The deployment of fault detection methods with prognostic purposes is not well investigated in the literature. Authors in [8] aim at reducing the outages in Medium Voltage distribution networks by exploiting rule-based, data mining and clustering techniques to design a method providing diagnostic and prognostic functions for Distri- bution Automation systems.