An Unsupervised Framework for Semantics
      Driven Causal Explanations for Anomalies?

       Bhanukiran Vinzamuri, Elham Khabiri, and Anuradha Bhamidipaty

              IBM T.J. Watson Research Center, Yorktown Heights, USA
                            Bhanu.Vinzamuri@ibm.com
                         {khabiri,anubham}@us.ibm.com


        Abstract. Explainability for anomaly detection from multivariate time
        series sensor data procured from large assets in Industry 4.0 is a challeng-
        ing and relatively unexplored problem. Apart from the temporal nature
        of time series data itself, another challenging aspect of this problem is
        the necessity of making the explainer context aware by infusing seman-
        tics which may originate from a different data modality. To address this
        problem, we present a workflow for the first-of-its-kind semantics-driven
        causal explainer for time series which uses Bayesian Network Structure
        learning techniques and tailors them for the anomaly explanation prob-
        lem by simultaneously leveraging the knowledge of asset semantics from
        a graph model. We also present explanatory insights obtained from in-
        vestigating a mechanical vibration anomaly for a steam turbine which
        was validated in our engagement with a large European energy company.


1     Introduction
The need for explainable AI has become more evident with the rise of deep
learning techniques which are often accurate in practice, but are very opaque
black boxes and do not offer any interpretable insights. Site Reliability Engi-
neer (SRE) monitoring large assets are often interested in obtaining causal ex-
planations which can identify cause effect relationships and are also known to
be robust against identifying spurious correlations. Causal explanations can be
made more meaningful in practice if they are obtained over semantically relevant
features which can be validated by the SRE more effectively. In this paper, we
present an approach to obtain causal local explanations for anomalies observed
in large power plant assets (such as steam turbines) using semantics in con-
junction with Bayesian Network Structure learning to capture causation rather
than correlation. One of the key novelties of our approach is that it can obtain
local explanations by learning from multivariate time series sensor data in an
unsupervised fashion. This approach does not apriori need a black box output to
be explained making it different from traditional black box explainability. How-
ever, the proposed method can be modified to explain black box outputs also if
needed. Semantic graph models play a vital role in our framework which encode
?
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2         Vinzamuri et al.

the structural model of the asset describing the relationships among different
components. Each physical component maps to a set of sensors, work orders
and notifications. The semantically relevant sensors (a pool of candidate global
explanations) for a potential anomaly is leveraged by our causal explainer to
identify the relevant sensors to explain an anomaly.

2      Background
We now provide some background to readers on how to estimate and compare
causal graphs from noisy non-linear time series data. Most Bayesian Network-
based graph learning methods suffer from computational complexity issues as
the number of nodes increases which makes it more important to use scalable
heuristic-based techniques. So, we develop a causal Bayesian Network tech-
nique which uses scalable greedy hill-climbing and mutual information based
non-linear directed information testers to estimate directionality (cause-effect).
Subsequently, nodes in these graphs can be compared using metrics such as
Hamming distance which compares the adjacency matrices.

3      Proposed Method
We now describe the steps involved in our approach.
    – SRE identifies a time window of interest based on visual inspection of mul-
      tivariate sensor data to investigate an anomaly looking for an explanation.
    – The semantic graph model is queried to identify a global explanation con-
      sisting of candidate sensors which correspond to the component failure.
    – Multiple causal Bayesian network graphs are inferred successively using
      method described above over uniform periodically sampled time intervals
      across the entire window of interest.
    – Graphs which encode the causal mechanism over each window are compared
      successively as explained above to identify the time of onset of the failure
      corresponding to time point causing maximum graph structural change. The
      graphs preceding and succeeding this point of onset explain the anomaly.

4      Results
We applied our approach outlined above for explaining a critical mechanical
vibration anomaly in a steam turbine. Mechanical vibration node from the se-
mantic graph model provided us with the candidate list of 40 sensors for the
causal explainer to identify local explanations. SRE investigated a 60 day win-
dow preceding the date of critical failure. Our approach was able to identify, a)
the point of onset of the failure (approximately 40 days preceding the critical fail-
ure), and b) journal bearing 4 was identified as the sensor with highest Hamming
distance before and after onset which was validated by the SRE after inspecting
the post failure repair logs. Our approach is agnostic to the kind of anomaly
being explained and can be used for explaining other kinds of critical failures
also. Scalability of our approach can be further improved by pre-computing the
causal graphs in the backend over the entire time horizon allowing the SRE to
investigate multiple windows efficiently.