<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Unsupervised Framework for Semantics Driven Causal Explanations for Anomalies?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bhanukiran Vinzamuri</string-name>
          <email>Bhanu.Vinzamuri@ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elham Khabiri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anuradha Bhamidipaty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM T.J. Watson Research Center</institution>
          ,
          <addr-line>Yorktown Heights</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Explainability for anomaly detection from multivariate time series sensor data procured from large assets in Industry 4.0 is a challenging and relatively unexplored problem. Apart from the temporal nature of time series data itself, another challenging aspect of this problem is the necessity of making the explainer context aware by infusing semantics which may originate from a di erent data modality. To address this problem, we present a work ow for the rst-of-its-kind semantics-driven causal explainer for time series which uses Bayesian Network Structure learning techniques and tailors them for the anomaly explanation problem by simultaneously leveraging the knowledge of asset semantics from a graph model. We also present explanatory insights obtained from investigating a mechanical vibration anomaly for a steam turbine which was validated in our engagement with a large European energy company.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The need for explainable AI has become more evident with the rise of deep
learning techniques which are often accurate in practice, but are very opaque
black boxes and do not o er any interpretable insights. Site Reliability
Engineer (SRE) monitoring large assets are often interested in obtaining causal
explanations which can identify cause e ect relationships and are also known to
be robust against identifying spurious correlations. Causal explanations can be
made more meaningful in practice if they are obtained over semantically relevant
features which can be validated by the SRE more e ectively. In this paper, we
present an approach to obtain causal local explanations for anomalies observed
in large power plant assets (such as steam turbines) using semantics in
conjunction with Bayesian Network Structure learning to capture causation rather
than correlation. One of the key novelties of our approach is that it can obtain
local explanations by learning from multivariate time series sensor data in an
unsupervised fashion. This approach does not apriori need a black box output to
be explained making it di erent from traditional black box explainability.
However, the proposed method can be modi ed to explain black box outputs also if
needed. Semantic graph models play a vital role in our framework which encode</p>
      <p>Vinzamuri et al.
the structural model of the asset describing the relationships among di erent
components. Each physical component maps to a set of sensors, work orders
and noti cations. The semantically relevant sensors (a pool of candidate global
explanations) for a potential anomaly is leveraged by our causal explainer to
identify the relevant sensors to explain an anomaly.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Background</title>
      <p>We now provide some background to readers on how to estimate and compare
causal graphs from noisy non-linear time series data. Most Bayesian
Networkbased graph learning methods su er from computational complexity issues as
the number of nodes increases which makes it more important to use scalable
heuristic-based techniques. So, we develop a causal Bayesian Network
technique which uses scalable greedy hill-climbing and mutual information based
non-linear directed information testers to estimate directionality (cause-e ect).
Subsequently, nodes in these graphs can be compared using metrics such as
Hamming distance which compares the adjacency matrices.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Proposed Method</title>
      <p>We now describe the steps involved in our approach.</p>
      <p>{ SRE identi es a time window of interest based on visual inspection of
multivariate sensor data to investigate an anomaly looking for an explanation.
{ The semantic graph model is queried to identify a global explanation
consisting of candidate sensors which correspond to the component failure.
{ Multiple causal Bayesian network graphs are inferred successively using
method described above over uniform periodically sampled time intervals
across the entire window of interest.
{ Graphs which encode the causal mechanism over each window are compared
successively as explained above to identify the time of onset of the failure
corresponding to time point causing maximum graph structural change. The
graphs preceding and succeeding this point of onset explain the anomaly.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Results</title>
      <p>We applied our approach outlined above for explaining a critical mechanical
vibration anomaly in a steam turbine. Mechanical vibration node from the
semantic graph model provided us with the candidate list of 40 sensors for the
causal explainer to identify local explanations. SRE investigated a 60 day
window preceding the date of critical failure. Our approach was able to identify, a)
the point of onset of the failure (approximately 40 days preceding the critical
failure), and b) journal bearing 4 was identi ed as the sensor with highest Hamming
distance before and after onset which was validated by the SRE after inspecting
the post failure repair logs. Our approach is agnostic to the kind of anomaly
being explained and can be used for explaining other kinds of critical failures
also. Scalability of our approach can be further improved by pre-computing the
causal graphs in the backend over the entire time horizon allowing the SRE to
investigate multiple windows e ciently.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>