Introduction

A Robust Alternative to Correlation Networks for Identifying Faulty Systems

Patrick Traxler

patrick.traxler@scch.at 1

Pablo Gómez

pablo.gomez@faw.jku.at 0

Tanja Grill

tanja.grill@scch.at 1 0 Institute of Applied Knowledge Processing, Johannes Kepler University , Linz , Austria 1 Software Competence Center Hagenberg , Austria

11 18

We study the situation in which many systems relate to each other. We show how to robustly learn relations between systems to conduct fault detection and identification (FDI), i.e. the goal is to identify the faulty systems. Towards this, we present a robust alternative to the sample correlation matrix and show how to randomly search in it for a structure appropriate for FDI. Our method applies to situations in which many systems can be faulty simultaneously and thus our method requires an appropriate degree of redundancy. We present experimental results with data arising in photovoltaics and supporting theoretical results.

Introduction

The increasing number of technical systems connected to the Internet raises new challenges and possibilities in diagnosis. Large amount of data needs to be processed and analyzed. Faults need to be detected and identified. Systems exist in different configurations, e.g. two systems of the same type that have different sets of sensors. Knowledge about the system design is often incomplete. Data is often unavailable due to unreliable data connections. Besides these and other difficulties, the large amount of data also opens new possibilities for diagnosis based on machine learning.

The idea of our approach is to conduct fault detection and identification (FDI) by comparing data of similar systems. We assume to have data of machines, devices, systems of a similar type and want to know if some system is faulty and if so, to identify the faulty systems. This situation may deviate from classic diagnosis problems in that we just have limited information (e.g. sensor or control information) of system internals. Moreover, we may have incomplete knowledge about the system design. This makes manual system modeling hard or even impossible. The problem is then to compare the limited information of the working systems (perhaps only input-output information) to identify faulty systems.

In this work we tackle one concrete problem of this kind. It is motivated by photovoltaics. We describe it in more detail below. The problem that arises in our and other applications is that not every two systems can be compared. We thus need to learn relations between systems.

There are different approaches to learn structure, e.g. learning Bayesian networks, Markov random fields, or sim1 2 3 4 5 6 1 2 3 4 5 6 (a) Fitness matrix 1 2

5 3

4 6 (b) Digraph ilar concepts. The concept that fits our needs are correlation networks. A correlation network is some structure in the correlation matrix, e.g. a minimum spanning tree or a clustering. In our application we have n variables which represent the produced energy per photovoltaic system. Given that a single system correlates strongly with enough other systems, we use this information for FDI via applying a median.

We can also think of correlation networks as a method for knowledge discovery. It has been applied in areas such as biology [18; 10] and finance [ 12 ] to analyze gene coexpression and financial markets. In our situation, the first step is to learn linear relations between systems. For learning we need historical data. A sample result of this step is depicted in Fig. 1. In Fig. 1(a) the fitness matrix, our robust alternative to the correlation matrix, is shown. It represents the degree of linearity between any two systems. For FDI, the second step of our method, we work with the result as depicted in 1(b) and current data. In the example, we derive for every of the six systems an estimation mˆ i of its current value yi from its neighbors current values, e.g. for system 1 we get an estimate from the current values of the systems 2, 3, 4 and for system 5 from system 6. Finally, we test for a fault by checking if |mˆ i − yi| is large.

The major difficulty we try to tackle with this approach is the presence of many faults. Faults influence both the learning problem and the FDI problem. Robustness is an essential property of our algorithms. Our result can be seen as a robust structure learning algorithm for the purpose of FDI. Robustness is a preferable property of many learning and estimation algorithms. However, the underlying optimization problems unlike their non-robust variants are often NP-hard. This is for example the case for computing robust and non-robust estimators for linear regression, e.g. Least Median of Squares versus Ordinary Least Squares [ 16 ]. We avoid NP-hardness by a careful modeling of our problem. In particular, our algorithms are computationally efficient. Under some conditions, FDI can be done in (almost) linear time in the number of systems n.

To summarize our contributions, we introduce a novel alternative to the sample correlation matrix and present a first use of it to discover structure appropriate for general FDI and in particular for identifying faulty photovoltaic systems (PV). Our method works in the presence of many faults. Our algorithms are computationally efficient. Our method incorporates a couple of techniques from machine learning and statistics: (Repeated) Theil-Sen estimation for robust simple linear regression. Trimming to obtain a robust fitness measure. Randomized subset selection for improved running time. And a median mechanism to conduct FDI.

In Sec. 2 we discuss our method. In Sec. 3 we present experimental and theoretical results. 1.1

Motivating Application: Identifying Faulty Photovoltaic Systems

Faults influence the performance of photovoltaic systems (PV). PV systems produce less energy than possible if faults occur. We can distinguish between two kinds of faults. Faults caused by an exogenous event such as shading, (melting) snow, and tree leafs covering solar modules. And faults caused by endogenous events such as module defects and degradation, defects at the power inverter, and string disconnections.

We are going to detect faults by estimating the drop in produced energy. Most of the common faults result in such a drop. The particular problem is given by the sensor setup. We just assume to know the produced energy and possible but not necessarily the area (e.g. the zip code) where the PV system is located.

We apply our method to PV system data. Difficulties in the application are different system types and deployments of systems. For example, different number of strings and modules per string and differing orientation (north, west, south, east) of the modules; see Fig. 2. Moreover, the lack of information due to the lack of sensors and incomplete data due to unreliable data connections. Faults occur frequently, in particular exogenous faults during winter.

The novelty of our work in the context of photovoltaics is that it works in an extremely restrictive (only power measurement) sensor setting. To the best of our knowledge, we are the first to consider this restrictive sensor setting. We only need to know the produced energy of a PV system. There is also the implicit assumption, which is tested by the learning algorithm, that the systems are not too far from each other so that we can observe them in similar working (environmental) conditions. Distances of a couple of kilometers are possible. Systems which are very close to each other and have the same orientation such as systems in a solar power plant yield the best results. Other approaches assume the presence of a plane-of-array-irradiance sensor which are mostly deployed for solar power plants. Irradiance estimations via satellite imaging are usually not accurate enough. 1.2

Related Work

Correlation networks have applications in biology and finance. See e.g. [12; 18; 10] and the references therein. In biology [18; 10], they are applied to study gene interactions. The correlation matrix is the basis for clustering genes and the identification of biologically significant clusters. In [18; 10], a scale-free network is derived via the concept of topological overlap. Scale-free networks tend to have few nodes (genes) with many neighbors, so called hubs.

Correlation networks are primarily used for knowledge discovery. In particular, concepts such as clusters, hubs, and spanning trees are interpreted in the context of biology and finance. In our work, we introduce a robust alternative to correlation networks.

Other structural approaches, i.e. approaches based on graphical models, are based on Bayesian networks, Markov random fields and similar concepts. Gaussian Markov random fields are loosely related to correlation networks. Their structure is described by the precision matrix, the inverse covariance matrix (ch. 17.3, [ 9 ].)

Another structural approach is FDI in sensor networks [7; 4; 19; 20]. The current approach [7; 4; 19] mainly deals with wireless sensor networks. The algorithms usually use the median for FDI such as we do. The difference is that FDI in wireless sensor networks uses a geometric model similar to interpolation methods. It requires the geographic location of the sensors. It is assumed that two sensors close to each other have a similar value. This cannot be assumed in general. To overcome these problems of manual modeling, we apply machine learning techniques.

Models for PV systems are compared in [ 14 ]. All these models require the plane-of-array irradiance. Fault detection of PV systems is the topic of e.g. [3; 8; 5; 2; 17]. Firth et al. [ 8 ] consider faults if the PV system generates no energy. Another type of fault occurs if the panels are covered by snow, tree leaves, or something else. In this case, we can observe a drop in energy. It is considered e.g. in [5]. The fraction of panel area covered is a crucial parameter. All these approaches [3; 8; 5; 2; 17] require at least the knowledge of the plane-of-array irradiance, i.e. it requires an irradiance sensor installed. We do make this assumption.

The median is common in fault detection and identification. One reason for this circumstance is its optimal breakdown point [ 16 ]. We also make use of (repeated) TheilSen algorithms [6; 15] for learning. An ingredient of our fault identification algorithm is the algorithm for median selection [1] and an algorithm for generating uniform subsets efficiently (see e.g. the Fisher-Yates or Knuth shuffle in [ 13 ].) In our algorithm analysis we derive bounds for a partial Cauchy-Vandermonde identity (pg. 20 in [ 11 ]). 2 2.1

Method Data Model for Incomplete Data

We have data from n systems and one data stream per system. A data stream for system i ∈ {1, . . . , n} is given by a set Ni ⊆ {1, . . . , N } of available data and values xi,t ∈ R with t ∈ Ni. We can think of the parameter t as discrete time. With Ni, we explicitly model data availability. Incomplete data is a common problem in our situation. Causes in practice are unreliable data connections or unreliable sensors. We call D := {(xi,t)t∈Ni : i ∈ {1, . . . , n}} a data set. Sets of historical and current data are the inputs to our algorithms.

2.2 Fitness Matrix – Definition and Robustness

The fitness matrix is intended as a robust replacement for the sample correlation matrix. The sample correlation coefficient such as the sample covariance is well known to be sensitive to faults (outliers) [ 16 ]. As an example, we generated the data for Fig. 1 with faults. The non-robust sample correlation matrix would have yield a digraph without edges instead of the digraph in Fig. 1(b).

A fault can be an arbitrary corruption of a single data item xi,t. That is, xi,t = x˜i,t + Δ, Δ 6= 0, where Δ is the fault. We think of x˜i,t as the actual or true but unobserved value.

We do not make any assumptions on faults themselves but only on their number. This is at core of the definition of the breakdown point. This statistical concept is defined for a particular estimation or learning problem. In our case for simple linear regression.

Linear regression is closely related to the correlation coefficient. For simple linear regression – a regression model with one independent and one dependent variable – the correlation coefficient can be seen as a fitness measure of the line which fits the data best w.r.t. vertical squared distances. See e.g. [ 16 ]. However, the corresponding estimator, namely `2-regression a.k.a. ordinary least squares, is known to be sensitive to outliers [ 16 ]. On the other hand, there are estimators for simple linear regression which are robust to a large number of faults, i.e. they have a large breakdown point.

The idea underlying the fitness matrix is thus to replace the correlation coefficient (and `2-regression) by a robust notion of fitness based on robust linear regression. In the remainder of this section we recall the definition of the breakdown point following [ 16 ], pg. 9, and we are going to formalize the notion of fitness matrices.

We define the breakdown only for simple linear regression. We fix two systems i, j ∈ {1, . . . , n} and define Z := Zi,j := {(xi,t, xj,t) : t ∈ Ni ∩ Nj }. Let T be a regression estimator, i.e. T (Zi,j ) = θˆ ∈ R2 is the intercept and slope for the data set Zi,j . For Z, we define Z0 as Z with m data points arbitrarily corrupted. Define bias(m; T , Z) := sup kT (Z) − T (Z0)k.

Z0 If bias(m; T , Z) is infinite, then m faults (outliers) have an arbitrarily large effect on the estimate T (Z0). The (finite sample) breakdown point of T and Z is defined as m |Z|

To explain this notion, we consider four typical examples. The breakdown point ε∗(T`2 , Z) is 1/n for `2-regression. This holds for any Z. The situation is different for `1regression in that ε∗(T`1 , Z) = 1/n for some Z.

In this work we are going to use the Theil-Sen estimator1 T TS a.k.a. median slope selection. The reason is its breakdown point of at least 1 − √12 ≥ 0.292 (see e.g. [6]) and the : bias(m; T , Z) = ∞ .

ε∗(T , Z) := min 1There is a subtle issue here we have to deal with. Regression problems are optimization problems. The solution to the concrete optimization problem does not need to be unique. In our situation, intercept and slope are unique for `2-regression but not for `1-regression. The estimator T`1 is however unique for a (deterministic) algorithm solving the optimization problem. We thus think of T`1 as the output of a particular (deterministic) algorithm. wide availability of efficient implementations of near-linear time algorithms. There is also a variant of T TS, called the repeated Theil-Sen estimator, which has a breakdown point of 0.5, but less efficient implementations. The concrete definition of T TS can be found e.g. in [6]. It is however not important for our application, only its robustness property and the existence of efficient implementations are.

To define the breakdown point of a fitness matrix, let f be a real-valued function defined on any finite data set. We define the fitness matrix as and its breakdown point as

Fji := f (Zi,j ) ε∗(F ) := min ε∗(f, Zi,j ).

i,j Next, we provide the fitness matrix we are going to use. It has the property that Fji is close to zero if xi and xj are strongly linearly related and it has a high breakdown point.

Let yt := xi,t and yˆt := xj,t · θˆ2 + θˆ1, t ∈ Ni ∩ Nj , for the Theil-Sen estimate θˆ of Zij . Let rt := yˆt − yt be the residuals. And let i1, . . . , ik ∈ Ni ∩ Nj with k := |Ni ∩ Nj | be such that |ri1 | ≤ · · · ≤ |rik |. We define f TS(Zij ) :=

1 Pbk/√2c |yit | t=1 · bk/√2c

X t=1 |rit |.

(1) We define

F TS w.r.t. f TS, i.e.

FjTiS := f TS(Zi,j ).

Note that the sum goes from 1 up to bk/√2c. This trimming together with the high breakdown point of Theil-Sen directly implies the following result.

Theorem 1. It holds that ε∗(F TS) ≥ 1 − √12 .

Finally, we compare the sample correlation matrix and the fitness matrix. Let C denote the sample correlation matrix and define Cj0i = 1 − |Cji|. Both matrices have the property that if some entry is close to 0 then xi and xj have a strong linear relation. It is guaranteed that Cj0i is at most 1. A value close to 1 means a weak linear relation. For FjTiS, it is not guaranteed that FjTiS ≤ 1, but experimental results suggest that it is usually the case. We also note that both matrices obey a weak form of the triangle inequality since if xi and xj correlate strongly and xj and xk correlate strongly, then also xi and xk correlate.

There are two important benefits of fitness matrices over correlation matrices. They are robust and are also defined for incomplete data. On the negative side, the fitness matrix is not positive semi-definite, in particular not symmetric. 2.3

Structure in Fitness Matrices – Algorithm

LEARN and IDENTIFY We want to identify faulty systems. In a first step, we learn a structure appropriate for FDI; see algorithm LEARN. We obtain it via thresholding the fitness matrix. Most of the correlation networks, i.e. structures arising from the sample correlation matrix, are obtained in this way [12; 18; 10]. We denote the threshold by θ ≥ 0 and the threshold fitness matrix as

Fji;θ :=

Fji 0 if Fji ≤ θ if Fji > θ . Algorithm 1 Algorithm LEARN with input D (data set) and parameter θ (fitness threshold). Output is a digraph G with edge labels (intercept, slope) representing the threshold fitness matrix.

Let G = (V, E) be a digraph with V = {1, . . . , n} and E = {}. for all i ∈ V and j ∈ V \ {i} do

Learn (Theil-Sen) the intercept aj,i and slope bj,i between xi (dependent variable) and xj (independent variable). end for for all i ∈ V and j ∈ V \ {i} do

Compute the trimmed fitness f = f TS (Eq. 1) of Zi,j . end for if f ≤ θ then

Add to G the directed edge from j to i with edge labels (aj,i, bj,i).

end if The input to algorithm LEARN is a data set D as described in Sec. 2.1. It outputs a digraph G = (V, E), i.e. the (possible sparse) threshold fitness matrix FθTS. Additionally, intercept and slope of the simple linear regressions are added as edge labels.

Algorithm 2 Algorithm IDENTIFY with input G (digraph with edge labels), current data yi for the i-th system, and parameters k and s (deviation). It outputs the set of all faulty systems H .

else Set H = {}. for all i ∈ V = {1, . . . , n} do

Let Ni− := {j ∈ V : (j, i) ∈ E}. if |Ni−| = 0 then

Continue with the next (system) i. end if if |Ni−| ≥ k then

Select uniformly at random a k-element subset S from Ni−.

Set S := Ni−. end if Compute Mi := {yˆj = bj,i · yj + aj,i : j ∈ S}. Compute the median mˆ i of Mi.

Add i to H if |mˆ i − yi| > s end for Output H .

In the second step, we identify the faulty systems; see algorithm IDENTIFY. Its input is the result of algorithm LEARN. Algorithm IDENTIFY constructs a random digraph of in-degree at most k for FDI. It works as follows. Independently for every system, we choose uniformly at random at most k of its neighbors in the digraph G and compute the median mˆ i of estimated values derived from the selected neighbors values. We compare the median mˆ i to the current system value and decide whether it has a fault or not via the deviation parameter s.

We discuss the threshold parameter θ and the deviation parameter s in Sec. 3.1. They essentially depend on the variance in the data set D. Parameter k in algorithm IDENTIFY has the purpose of improving running time efficiency. In particular, we have the following result.

Theorem 2. Let D be a data set with n systems and let m := maxi |Ni|. The running time of LEARN is O(n2 · m · log(m)). The running time of IDENTIFY is O(k · n). Proof. LEARN. There are O(n2) pairs of systems. The Theil-Sen estimator can be computed in time O(m log(m)) [6]. The computation of f TS, Eq. 1, is done via sorting and thus takes time O(m log(m)).

IDENTIFY. Assume |Ni−| ≥ k. We uniformly at random choose a k-element subset out of N − and compute i the median. For random selection we can use for example the Fisher-Yates (or Knuth) shuffle [ 13 ] which runs in time O(k) and for median selection the algorithm in [1] which also runs in time O(k). The second case, 1 ≤ |N − i | ≤ k −1, is analogous. This shows that the overall running time of IDENTIFY is O(kn).

In Sec. 3.2, we provide some sufficient conditions that IDENTIFY works correctly even if k = O(log(n)). This is a strong running time improvement from O(n2) to O(n · log(n)). 3 3.1

Results Experimental

In this section we are going to discuss how to apply our method, Sec. 2, to photovoltaic data. In particular, it remains to discuss how the use-case fits to the model. More precisely, why there is strong correlation between PV systems. Finally, we present experimental results to verify the estimation and fault identification quality of our algorithms.

Use-Case Photovoltaics

A simple system model for PV systems is as follows:

Pi = ci · Ii.

Here, Pi is the power, Ii the plane-of-array (POA) irradiance of the i-th system, and ci a constant of the system which can be interpreted as the efficiency of converting solar energy into electrical energy. More complex physical models include system variables such as the module temperature [14; 17]. Our considerations translate to the more complex models as long as they are time-independent. We also note that these models are more accurate, but only slightly, since the POA-irradiance has the most critical influence on the produced energy.

We get from the above considerations that Pi = c0ij · Pj given that Ii = Ij . In our situation we cannot test the condition Ii = Ij since we do not know the POA-irradiance, but Ii ≈ Ij holds if the system operate under similar weather conditions and have a similar orientation. The former holds if the systems are close to each other. To reduce the effect of different orientations, see Fig. 2, we consider the following model: PiΔ = uij · PjΔ + vij . The variable PiΔ is the power within a time interval Δ, usually one hour. The variables uij and vij are the unknowns.

In more general words, let Yi be the output of the i-th system and let Xi describe the system input and system internals. Our model assumption is that for a reasonable number of system pairs (i, j), the system outputs Yi are Yj are linearly related given that Xi ≈ Xj . By the above considerations, it is plausible that PV systems fulfill these requirements. 12:00

13:00 Time

We next describe our experimental setup to verify it by real data.

Experimental Setup

To demonstrate our method, we use two data sets DA and DK. DA arises from 13 systems from a solar park located in Arizona2. The PV systems there are geographically close. We use data for one year. DK arises from 40 systems spread across a typical municipality located in Austria, i.e. the systems can be up to some kilometers apart. Their orientation can differ significantly. Some systems may be orientated to the west, others to the east. We have data for almost a year.

A system is faulty if it produces considerable less energy than estimated; see Fig. 3. This definition is motivated by the fact that most faults imply a drop in energy. The difficulty in setting up an experiment is that we do not know if a PV system is faulty in advance, i.e. we do not have labeled data. We thus design our experiment as follows: We verify the accuracy of the energy estimation, namely the relative deviation |mˆ i − yi|/|mˆ i| for every system i and over the period of a week, mˆ i and yi as in algorithm IDENTIFY.

This relative deviation is noted in column Hour of Table 1 for the time period 12:00 to 13:00. In column Day of Table 1 we note the same but for a whole day, i.e. mˆ i is the estimated energy (power) for the whole day calculated from the hourly estimates and yi the actual energy for the whole day. For the whole day we consider the time period from 9:00 to 16:00.

The number |mˆ i − yi|/|mˆ i| can be read as some relative deviation, i.e. the estimation is 100 · x% away from the truth value where x is some entry in the column Hour and Day.

POA−irradiance

Estimate

Real

The entry x is an average over all systems and 7 days. The first day is noted in column Start.

Algorithm LEARN is executed once for every week and with θ = 0.8 and roughly three months of historical data, e.g. for the months January, February, and March to get estimates for the days April, 1. to April, 7. Algorithm IDENTIFY is executed with s = 0.25 · |mˆ i| and k = 11 for both data sets. The choice of parameters θ and s depend on the variance of the input data and were chosen manually, so to get a reasonable number of good estimates. Similar for k. The difficulty in choosing the parameters is that increasing θ will usually reduce the number of neighbors. For a reasonable number of good estimates we need both: A strong linear relation of a system to its neighbors and enough neighbors. The parameters were chosen accordingly. For parameter k, we derive a theoretical result in Sec. 3.2 which says that k = O(log(n)) is a good choice for n the number of systems.

Experimental Results

The false positive rate (FPR), the false negative rate (FNR), and the estimation accuracy are the most interesting numbers for us. As remarked above, we do not have labeled data. The faults as recorded in Table 1 are faults as detected by our algorithm.

We make a worst case assumption, namely that all detected faults are false positives. This yields a FPR of at most 0% to 5% per 7 day period (rows in the table.) To get an understanding of FNR, we simulated faults by subtracting 33% percent of energy. The FNR in this case is at most 10% per 7 day period. In the rows Sum and Sum−33% in Table 1 we summed up the faults to get the FPR and FNR for the whole

Start

April, 1.

May, 1.

June, 1.

July, 1.

Aug., 1.

Sept., 1.

Oct., 1.

Nov., 1.

Dec., 1.

Sum

Sum−33%

Start

June, 1.

June, 15.

July, 1.

July, 15.

Aug., 1.

Aug., 15.

Sept., 1.

Sept., 15.

Sum Sum−33% (a) Results for DA.

(b) Results for DK. data sets.

The interpretation of these results is as follows. Setting the parameter s to 0.25 · |mˆ i| means that we define a fault as a 25% relative deviation of the observed produced energy from its true value. Setting s to this value, yields the above mentioned FPR. Simulating a 33% drop in energy, which corresponds naturally to the faults we want to detect, yields the above FNR.

For the data set DA we have knowledge about the POAirradiance. We can thus cross-check with the irradiance to check if faulty systems were identified correctly; see Fig. 3. This manual inspection suggests that the FPR is much smaller than 5%, close to 1%. Furthermore, increasing the drop implies a decreasing FNR, i.e. stronger energy drops are easier to identify.

Depending on the application, these rates may be considered appropriate or not. In some applications, we may want to detect faults which yield a drop in energy of less −25%. This worsens the FPR and FNR. On the other side, if we want to improve the FPR and FNR, we may have to specify a fault as a drop in energy of −50%. In other words, our parameter setting is one out of many reasonable parameter settings. We argued in Sec. 3.1 that algorithm LEARN yields good estimates for the systems current value. For an estimate to be good, the neighboring system j in G of system i needs to work correctly. Moreover, the regression estimates, the intercept and slope, need to be accurate enough. In this section, we provide a supporting theoretical result which says that, if enough estimates are good, algorithm IDENTIFY correctly identifies all faulty systems.

The input to IDENTIFY is a digraph G = (V, E) with edge labels. Let yi be the current value of system i. Let yi = y˜i + Δi. We think of y˜i as the true value. We say that system i is correct if Δi = 0 and faulty otherwise.

The input to IDENTIFY has to satisfy two conditions, Eq. 2 and 3, to work correctly. These conditions state that there are more good than bad estimates. We formulate them below.

Theorem 3. Let 0 < p < 1 and s > 0. Let H := {i ∈ {1, . . . , n} : |Δi| > 2s}. Assume that the input digraph G satisfies Eq. 2 and 3. Then, algorithm IDENTIFY outputs H with probability at least 1 − p.

Let yˆj be the estimates as computed in IDENTIFY. Fix a system i and let j ∈ Ni−. We say that yˆj is s-good for system i if |y˜i − yˆj | ≤ s. Let Ai := {j ∈ Ni− : |y˜i − yˆj | ≤ s} be the s-good estimates for system i. Condition 2 is as follows: For every system i with 1 ≤ |Ni−| ≤ k − 1 it holds that |Ai| > |Ni−| , 2 i.e. there are more good than bad estimates. For the case that |Ni−| ≥ k we assume |Ai| >

1 1 − cn,p,k · |Ni−|, (2) (3) with cn,p,k := ( n

p · 18k)2/(k−1). Setting k = Ω(log( np )) makes cn,p,k larger than some constant independent of n and p. This is the most reasonable setting as it implies that a constant fraction of estimates can be bad and IDENTIFY still identifies the faulty systems correctly. We remark that the asymptotic analysis which yields cn,k,p is not optimal. In particular, it seems that the factor 18k is not optimal and may be improved to a factor as small as 2k/2. For practical applications, the following heuristic seems reasonable: For n systems and a failure probability p of IDENTIFY, set k to 10 · log( np ). 3.3

Proof of Theorem 3

We apply the following lemma with A = Gi and M = Ni−. It directly gives us the probability that IDENTIFY correctly identifies the faulty systems since the median works correctly if |S ∩Ai| > |S ∩(Ni− \Ai)|, where S is the (random) set chosen in IDENTIFY.

Lemma 1. Let M be a finite set and A ⊆ M . Let k ≥ 2 be an integer. Let S ⊆ M be a k-element subset selected uniformly at random. Then PSr(|S ∩ A| > |S ∩ (M \ A)|) ≥ 1 − 18k |M \ A| |M | bk/2c .

Proof. Let M := {1, . . . , m}, F := M \ A, and r := |F |. First, we are going to bound the number of k-element subsets S ⊆ M for which |S ∩ G| ≤ k0 with k0 = bk/2c. The exact number of these sets is k0 X i=0 m − r i

r k − i since there are |A| ways to choose an i-element subset i from A and k|F−|i ways to choose from F for the remaining k − i elements.

Note that |S ∩ A| > |S ∩ F | iff |S ∩ A| > b|S|/2c = k0. Moreover, we can assume that r = |F | ≥ 1 since the claim holds for r = 0. To provide a lower bound for the probability of this event we show an upper bound on the complementary event, i.e. |S ∩ A| ≤ k0. First, we derive an upper bound for Eq. 4 using m k k ≤ m k ≤ me k k for e = 2.714 . . . and 1 ≤ k ≤ m. (See e.g. pg. 12 in [ 11 ].) Since this inequality holds just for k ≥ 1 we rewrite Eq. 4 as r k

k0 + X i=1 m − r i

r k − i .

It holds that kr ≤ ( re )k and for the second term in Eq. 6 k (4) (5) (6) k−i k0 X i=1 m − r i

k0 = (re)k X

r k − i k0 X i=1

i ≤ m − r r (m − r)e i

i k − i i i

1 k − i

re k − i k i=1 Next, we prove the upper bound on the probability p that |S ∩ A| ≤ k0. We select uniformly at random a k-element subset of M . Its probability is m −1. We multiply Eq. 6 k with m −1 and get two parts p1 + p2 ≥ p. For the first part k p1 ≤ ( rme )k since mk −1 ≤ (k/m)k. For the second part p2, we use mr−r ≤ mr , ((k−i)/i)i ≤ 2k, and (k/(k−i))k ≤ 2k. The latter since i ≤ k0. We get an upper for the second part: p2 ≤ re m k k0

X i=1 m − r

Proof of Theorem 3. We show that the success probability of IDENTIFY is at least 1 − p. Let p0 := np . We show that for every i ∈ V , G = (V, E), the success probability of a single iteration in the loop of IDENTIFY is at least 1 − p0. This implies the above claim since (1 − p0)n ≥ 1 − p0n by e.g. the Binomial Theorem.

Fix some i ∈ V , i.e. we consider one iteration in the loop of IDENTIFY. We apply Lemma 1. Let us assume that |S ∩ Ai| > |S ∩ (Ni− \ Ai)|, S the random k-element subset as in IDENTIFY and Ai the good estimates as defined above. Since |S∩Ai| > |S∩(N −

i \Ai)|, it follows that |mˆ i−y˜i| ≤ s for the median mˆ i as computed in IDENTIFY and yi = y˜i + Δi.

Assume Δi = 0, i.e. system i works correctly. Then, |mˆ i − yi| = |mˆ i − y˜i| ≤ s. Thus, i is not output.

Finally, we want that the probability of failure for a single step is at most np . By Lemma 1, 18kαbk/2c ≤ 18kα(k−1)/2 ≤ np with α := |Ni−\Ai| . With c = cn,p,k := |Ni−| ( np · 18k)2/(k−1), c · |Ni− \ Ai| ≤ |Ni−| and thus (1 − 1/c) · |Ni−| ≤ |Ai|. 4

Conclusions and Open Problems

We presented a method for learning structure to identify faulty systems. The basic method of correlation networks has found many applications in biology and finance. In our application, the presence of many faults required the design and analysis of robust algorithms. We provided an experimental analysis of our algorithms to verify their estimation and fault identification quality. We also provided a supporting theoretical result which allowed us to considerable improve the running time of algorithm IDENTIFY.

Improving the running time of LEARN remains as an open problem. It is not directly clear that it is necessary to compare every two systems. The reason is that if systems (i, j) and (j, k) correlate strongly, then also (i, k) correlate, but not necessarily strongly. Thus, it may not be necessary to solve a simple linear regression problem for every system pair.

In other applications it may be useful to solve a general linear regression problem instead of a simple linear regression, e.g. if our model depends on more than one variable per system. The corresponding correlation networks are based on the partial correlation coefficient [ 12 ]. Since robust estimators for general linear regression are based on regression problems which are NP-hard, it remains as an open problem to find a robust alternative to partial correlation networks that can be computed efficiently.

Finally, to put our method and results into a broader context, we approached the problem of FDI via learning graphical models. It seems to be a challenge to learn classical component-models of technical systems to conduct diagnosis. In this work we were able to close the gap between (structure) learning on the one side and FDI on the other side for a concrete problem setting. i k − i i i 12r m

k k − i k k0

X i=1 k

≤ m i

. r An upper bound for the geometric sum is k0(m/r)k0 . In total Substituting k−1 for k0 and further simplification yields 2 p ≤ p1 + p2 ≤ p ≤ (k + 2)(12)k 2 re m r m k

+ k0 (k−1)/2 12r m k m k

r ≤ 18k r m (k−1)/2 .

The latter since ((k + 2)/2)1/k ≤ 1.5 for k ≥ 3. We have thus a lower bound for the probability 1 − p and the claim follows.

Manuel

Blum , Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest , and Robert E.

Tarjan . Linear time [17] P.

Traxler . Fault detection of large amounts of photovoltaic systems . In Online Proc. of the ECML PKDD 2013 Workshop on Data Analytics for Renewable Energy Integration (DARE'13) , 2013 .

[18]

Bin

Zhang and

Steve

Horvath . A general framework for weighted gene co-expression network analysis . Statistical Applications in Genetics and Molecular Biology , 4 ( 17 ), 2005 .

[19] Chongming

Zhang

, Jiuchun Ren,

Chuanshan

Gao , Zhonglin Yan, and Li

. Sensor fault detection in wireless sensor networks . In Proc. of the IET International Communication Conference on Wireless Mobile and Computing , pages 66 - 69 , 2009 .

[20] Yang Zhang, N.

Meratnia , and P.

Havinga . Outlier detection techniques for wireless sensor networks: a survey . Communications Surveys and Tutorials , IEEE, 12 ( 2 ): 159 - 170 , 2010 .

bounds for median computations . In Proc. of the 4th Annual ACM Symposium on Theory of Computing , pages 119 - 124 , 1972 .

In 37th IEEE International Conference on Acoustics, Speech and Signal Processing , pages 1681 - 1684 , 2012 .

K. H. Chao , S. H.

Ho , and M. H.

Wang . Modeling and fault diagnosis of a photovoltaic system . Electric Power Systems Research , 78 ( 1 ): 97 - 105 , 2008 .

[4]

Jinran

Chen , Shubha Kher, and

Arun

Somani . Distributed fault detection of wireless sensor networks . In Proc. of the 2006 Workshop on Dependability Issues in Wireless Ad Hoc Networks and Sensor Networks , pages 65 - 72 , 2006 .

Energy

Conversion and Management, 51 : 1929 - 1937 , 2010 .

An optimal-time algorithm for slope selection . SIAM Journal on Computing , 18 ( 4 ): 792 - 810 , 1989 .

Ding , Dechang Chen, Kai Xing, and Xiuzhen Cheng. Localized fault-tolerant event boundary detection in sensor networks . In Proc. of the 24th Annual Joint Conference of the IEEE Computer and Communications Societies , volume 2 , pages 902 - 913 , 2005 .

[8]

S.K.

Firth ,

K.J.

Lomas , and

S.J.

Rees . A simple model of PV system performance and its use in fault detection . Solar Energy , 84 : 624 - 635 , 2010 .

[9]

Trevor

Hastie , Robert Tibshirani, and

Jerome

Friedman . The elements of statistical learning . Springer, 2008 .

[10]

Steve

Horvath . Weighted network analysis: applications in genomics and systems biology . Springer Science & Business Media , 2011 .

[11]

Jukna . Extremal combinatorics: with applications in computer science . Springer, 2nd edition , 2011 .

[12] Dror

Kenett , Michele Tumminello, Asaf Madi, Gitit Gur-Gershgoren, Rosario N.

Mantegna , and Eshel Ben-Jacob. Dominating clasp of the financial sector revealed by partial correlation analysis of the stock market . PLoS ONE , 5 ( 12 ): e15032 , 12 2010 .

[13] Donald

E. Knuth.

The art of computer programming: seminumerical algorithms , volume 2 . Addison-Wesley Longman Publishing Co., Inc., 3rd edition , 1997 .

[14]

Marion . Comparison of predictive models for PV module performance . In 33rd IEEE Photovoltaic Specialist Conference , pages 1 - 6 , 2008 .

[15]

Matousek ,

D. M.

Mount , and

N. S.

Netanyahu . Efficient randomized algorithms for the repeated median line estimator . Algorithmica , 20 ( 2 ): 136 - 150 , 1998 .

[16] Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection , volume 589 . John Wiley & Sons, 2005 .