=Paper= {{Paper |id=Vol-3375/paper6 |storemode=property |title=Real-time Anomaly Detection for Multivariate Data Streams |pdfUrl=https://ceur-ws.org/Vol-3375/paper6.pdf |volume=Vol-3375 |authors=Kenneth Odoh |dblpUrl=https://dblp.org/rec/conf/amlts/Odoh22 }} ==Real-time Anomaly Detection for Multivariate Data Streams== https://ceur-ws.org/Vol-3375/paper6.pdf
Real-time Anomaly Detection for Multivariate Data
Streams
Kenneth Odoh


                                          Abstract
                                          We present a real-time multivariate anomaly detection algorithm for data streams based on the Probabilistic Exponentially
                                          Weighted Moving Average (PEWMA). Our formulation is resilient to (abrupt transient, abrupt distributional, and gradual
                                          distributional) shifts in the data. This novel anomaly detection routines utilize an incremental online algorithm to handle
                                          streams. Furthermore, our proposed anomaly detection algorithm works in an unsupervised manner, eliminating the need for
                                          labelled examples. Our algorithm performs well and is resilient to concept drift.

                                          Keywords
                                          signal processing, online learning



1. Introduction                                                                                                                             • Static: These algorithms work with static data.
                                                                                                                                              Every item is loaded into memory at the same
Anomaly detection is the task of classifying patterns that                                                                                    time in order to perform computations.
depict abnormal behaviour. Outliers can arise due to                                                                                        • Online: These algorithms work in real-time data
(human/equipment) errors, faulty systems, and others.                                                                                         streams. Items are incrementally loaded into mem-
Anomaly detection is well-suited to unbalanced data sce-                                                                                      ory and processed in chunks.
narios, where the ideal scenario is to predict minority                                                                                     • Static + Online: The model may operate in two
class behaviour. There are numerous applications for                                                                                          stages, as initial parameters get estimated in the
detecting loan default, fraud detection, and network in-                                                                                      static setting. The parameters are incrementally
trusion detection. An anomaly detection algorithm can                                                                                         updated as more data arrives. Our work is of this
work in Unsupervised, Supervised, or hybrid modes.                                                                                            type.
   There are different types of anomalies described as
follows.                                                                                                                              PEWMA was introduced in the work [1] for online
                                                                                                                                   anomaly detection on univariate time series. Drawing
     • Point Anomaly: the algorithm identifies a single                                                                            inspiration from their work, we have provided extensions
       instance as an anomaly concerning the entire data                                                                           to support real-time anomaly detection of a multivariate
       set.                                                                                                                        data stream.
     • Contextual Anomaly: a data instance can be anoma- 1
       lous based on the context (attributes and posi-
       tion in the stream) and proximity of the chosen
       anomaly. This anomaly type is ideal for multivari- 2. Background
       ate data, e.g. in the snapshot reading of a machine,
       an attribute of a single data point may seem ab- An anomaly detection algorithm can identify outliers in
       normal but can be normal behaviour based on a variety of signal changes in time-dependent data. Sig-
       consideration of the entire data.                    nal changes are common in time-dependent data. They
     • Collective Anomaly: the algorithm decides a set include abrupt transient shift, abrupt distributional shift,
       of data points that are anomalies as a group, but and gradual distributional shift [1] labelled as ”A”, ”B”,
       individually these data points exhibit normal be- and ”C” in Figure 1 respectively.
       haviours.                                               Online algorithms are useful for real-time applications,
                                                            as they operate incrementally on data streams. These
  Anomaly detection algorithms can operate in the fol- algorithms incrementally receive input and decide based
lowing settings:                                            on an updated parameter that captures the current state
                                                            of the data stream. This philosophy contrasts with of-
                                                            fline algorithms that assume the entire data is available
                                                            in memory. Offline algorithms require data must fit in
CIKM’22: Applied Machine Learning Methods for Time Series Forecast-
ing Workshop, October 21, 2022, Georgia, GA
                                                                                                                                   1
Envelope-Open kenneth.odoh@gmail.com (K. Odoh)                                                                                         Source code: https://github.com/kenluck2001/anomalyMulti,
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                    Attribution 4.0 International (CC BY 4.0).                                                         Blog: https://kenluck2001.github.io/blog_post/real-time_anomaly_
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                                            detection_for_multivariate_data_stream.html
                                                                  line inverse covariance matrix based on Sherman−Mor-
                                                                  rison formula in Subsection 3.3, and PEWMA in Subsec-
                                                                  tion 3.1.
Figure 1: Different types of signal changes: abrupt transient,       Our implementation of the online covariance matrix
abrupt distributional shift, and gradual distributional shift [1] builds on work [9]. We simplify the algorithm by ignor-
(from left to right).
                                                                  ing evolutionary computation details in the paper. Our
                                                                  work adapts evolution as described in the paper [9] as a
                                                                  transition from one generation to the next; it is equiva-
memory. The online algorithm should be time and space lent to moving from one state to another state. This is
efficient.                                                        analogous to how online algorithms work with dynamic
    An anomaly detection algorithm can work in modes changes as new data enters the stream.
such as diagnosis and accommodation [2]. Firstly, the di-
agnosis method finds the outlier in the data and removes 3.1. Probabilistic Exponentially
it from the data sample to avoid skewing the distribution.
This method is applicable when the distribution of ex-
                                                                           Weighted Moving Average (PEWMA)
pected behaviours is known. The outliers get excluded PEWMA [1] algorithm works in accommodation mode.
when the estimation of the parameters of the distribu- The routine shown in Algorithm [1] allows for concept
tion [2]. Secondly, the accommodation method finds drift [3], which occurs in data streams, by updating the
outliers in the data and estimates the statistical model set of parameters that convey the stream state.
parameters. The accommodation method is suitable for
data streams that account for concept drift [3].                  Algorithm 1 Probabilistic Exponential Weighted Mov-
    Muth [4] laid the foundation for exponential smooth- ing Average [1]
ing by showing that it provides optimal estimates using
                                                                  Require: 𝑋𝑡 , 𝑋̂ 𝑡 , 𝛼𝑡̂ , 𝑇 , 𝑡
random walks with some noise. Further works were                                            ̂ , 𝛼𝑡+1
                                                                  Ensure: 𝑃𝑡 , 𝑋𝑡+1               ̂
done to provide a statistical framework for exponential
smoothing, leading to the development of linear mod-                 / / i n c r e m e n t a l  Z s core
                                                                               𝑋 −𝑋̂
els [5, 6]. Unfortunately, this applies to nonlinear models.         𝑍𝑡 ← 𝑡𝛼̂ 𝑡
                                                                                    𝑡
Yule postulated a formulation of time series as a realiza-           //probability density function
tion of a stochastic process [7]. Exponential Weighted
                                                                                 𝑍 𝑍𝑡
Moving Average (EWMA) is ideal for keeping running                   𝑃𝑡 ← 𝑡 𝑒 2
                                                                               √2𝜋
moments in the data stream. EWMA is a smoothing                      if 𝑡 < 𝑇 then
technique that adds a forgetting parameter to modify                 //increment standard deviation (training phase)
the influence of a recent item in the data stream. This is
                                                                         𝛼𝑡 ← 1 − 1/𝑡
shown in Equation 1. This smoothing causes volatility
                                                                     else
in abrupt transient changes and is unfit for distribution
                                                                     //increment standard deviation
shifts.
                                                                         𝛼𝑡 ← (1 − 𝛽𝑃𝑡 )𝛼
                   𝜇𝑡+1 = 𝛼𝜇𝑡−1 + (1 − 𝛼)𝑋𝑡                   (1)    end if
                                                                 //moving average
   EWMA limitations motivated the discovery of the
Probabilistic Exponentially Weighted Moving Average              𝑠1 ← 𝛼𝑡 𝑠1 + (1 − 𝛼𝑡 )𝑋𝑡
(PEWMA). PEWMA [1] improves on EMWA by adding a                  𝑠2 ← 𝛼𝑡 𝑠2 + (1 − 𝛼𝑡 )𝑋𝑡2
parameter that includes the probability of the data in the       //incremental mean
model, as shown in Equation 2. PEWMA works for every               ̂ ← 𝑠1
                                                                 𝑋𝑡+1
shift including abrupt transient shift, abrupt distribu-         //incremental standard deviation
tional shift, and gradual distributional shift respectively.
                                                                   ̂ ← √𝑠2 − 𝑠12
                                                                 𝛼𝑡+1
PEWMA and EWMA use a damped window [8].

                                                              The parameters of the anomaly detection algorithm
       𝜇𝑡+1 = 𝛼(1 − 𝛽𝑃𝑡 )𝜇𝑡−1 + (1 − 𝛼(1 − 𝛽𝑃𝑡 ))𝑋𝑡      (2)
                                                           consist of 𝑋𝑡 the current data, 𝜇𝑡 the mean of the data, 𝑋̂ 𝑡
                                                           is the mean of the data, 𝛼𝑡̂ the current standard deviation,
                                                           𝑃𝑡 the probability density function, 𝑋𝑡+1 ̂ the mean of the
3. Method                                                  next data (incremental aggregate), 𝛼𝑡+1 ̂ the next standard
Our formulation provides an implementation of the on- deviation (incremental aggregates), 𝑇 the data size, and 𝑡
line covariance matrix in Subsection 3.2, alongside an on- a point in 𝑇. Initialize the process by setting the initial
data for training the model 𝑠1 = 𝑋1 and 𝑠2 = 𝑋12 .               3.3. Online Inverse Covariance matrix
   Our work used the following parameters 𝛼, 𝛽, and 𝜏
                                                                     1. Estimate covariance matrix for initial data, 𝑋 ∈
as seen in Subsection 3.2. These anomaly thresholds are
                                                                        𝑅𝑛×𝑚 .
chosen based on the criteria that outliers are ≥ 3 times
                                                                        The initial covariance matrix, 𝐶 where 𝐶 ∈ 𝑅𝑛×𝑚 ,
the standard deviation in normally distributed data.
                                                                        𝑛 is the number of samples, 𝑚 is the number of
                                                                        dimensions as shown in Equation 8.
3.2. Online Covariance matrix
    1. Estimate covariance matrix for initial data, 𝑋 ∈                                        𝐶 = 𝑋 ∗ 𝑋𝑇                    (8)
       𝑅𝑛×𝑚 .                                                           Inverse the covariance matrix, 𝐶 −1 as shown in
       Initial covariance matrix, 𝐶 where 𝐶 ∈ 𝑅𝑛×𝑚 , 𝑛                  Equation 9.
       is the number of samples, 𝑚 is the number of
                                                                                                                −1
       dimensions as shown in Equation 3.                                                𝐶 −1 = (𝑋 ∗ 𝑋 𝑇 )                   (9)

                            𝐶 = 𝑋 ∗ 𝑋𝑇                   (3)         2. Perform Cholesky factorization on the initial co-
                                                                        variance matrix, 𝐶 as shown in Equation 10.
    2. Perform Cholesky factorization on the initial co-
       variance matrix (positive-definite), 𝐶 as shown in                                    𝐶𝑡 = 𝐴 𝑡 ∗ 𝐴 𝑡 𝑇               (10)
       Equation 4.
                                                                     3. Increment the Cholesky factor of the covariance
                            𝐶𝑡 = 𝐴𝑡 ∗ 𝐴𝑡 𝑇               (4)            matrix

    3. The updated covariance in the presence of new                                   −1 = (𝛼𝐶 + 𝛽𝑣 ∗ 𝑣 𝑇 )−1
                                                                                      𝐶𝑡+1     𝑡    𝑡   𝑡                   (11)
       data is equivalent to the weighted average of the
       past covariance without the updated data and the
                                                                                                         𝛽𝑣𝑡 ∗ 𝑣𝑡 𝑇 −1
       covariance matrix of the transformed input, as                               −1 = 𝛼 −1 (𝐶 +
                                                                                   𝐶𝑡+1         𝑡                  )        (12)
       shown in Equation 5.                                                                                  𝛼
                                                                                         𝛽∗𝑣
                                                                        Let us fix, 𝑣𝑡̂ = 𝛼 𝑡 . The resulting simplification
                      𝐶𝑡+1 = 𝛼𝐶𝑡 + 𝛽𝑣𝑡 ∗ 𝑣𝑡 𝑇            (5)            using Sherman−Morrison Formula reduces the
       Where 𝑣𝑡 = 𝐴𝑡 ∗ 𝑧𝑡 and 𝑧𝑡 ∈ 𝑅𝑚 is understood in                  expression to
       our implementation as the current data. 𝛼 and 𝛽
       are positive scalar values.                                                     1           𝐶𝑡 −1 𝑣𝑡̂ 𝑣𝑡 𝑇 𝐶𝑡 −1
                                                                               −1 =
                                                                              𝐶𝑡+1       (𝐶𝑡 −1 −                       )   (13)
    4. Increment the Cholesky factor of the covariance                                 𝛼          1 + (𝑣𝑡̂ 𝐶𝑡 −1 𝑣𝑡 𝑇 )
       matrix as shown in Equation 6.
                                                                 3.4. Online Multivariate Anomaly
                                             2                        Detection
                           ⎛  𝛽‖𝑧𝑡 ‖    ⎞
                      √𝛼                                         The probability density function utilizes ideas from hy-
        𝐴𝑡+1 = √𝛼𝐴𝑡 + 2 ⎜ 1 +        − 1⎟ 𝑣𝑡 ∗ 𝑧𝑡 (6)
                           ⎜√   𝛼       ⎟                        pothesis testing for deciding on a threshold to set the
                     ‖𝑧𝑡 ‖
                           ⎝            ⎠                        confidence level. This threshold is used for determin-
                                                                 ing the acceptance and rejection regions of the Gaussian
    5. There are difficulties with setting /𝑎𝑙𝑝ℎ𝑎 and 𝛽
                                                                 distribution curve.
       respectively. 𝛼 + 𝛽 = 1 as an explicit form of
       exponential moving average coefficients. The                  1. Use the covariance matrix, 𝐶𝑡+1 and inverse co-
       author chose to set the values of 𝛼, 𝛽 using the                 variance matrix, 𝐶𝑡+1 −1 .
       data stream statistics, as shown in Equation 7.               2. We increment the mean vector, 𝜇 as new data
       The parameters are set as 𝛼 = 𝐶𝑎 2 , 𝛽 = 1 − 𝐶𝑎 2                arrives. It is possible to simplify the Covariance
       and 𝑛 is the size of the original data in the static             matrix, 𝐶, which captures a number of system
       settings.                                                        dynamics. Let 𝑛 represent the current data count
       Where 𝐶𝑎 = √1 − 𝐶𝑐𝑜𝑣 and 𝐶𝑐𝑜𝑣 = 𝑛22+6 .                          before updated data arrives. Also, 𝑥:̂ is the most
                                                                        recent data, 𝜇𝑡+1 : moving average as shown in
                                                 2                      Equation 14.
                       𝐶𝑎
                           ⎛     (1 − 𝐶𝑎 2 )‖𝑧𝑡 ‖     ⎞
       𝐴𝑡+1 = 𝐶𝑎 𝐴𝑡 + 2    ⎜  1+                  − 1 ⎟ 𝑣𝑡 ∗𝑧𝑡                                     (𝑛 ∗ 𝜇𝑡 ) + 𝑥̂
                           ⎜          𝐶   2           ⎟                                  𝜇𝑡+1 =                             (14)
                     ‖𝑧𝑡 ‖   √          𝑎                                                             𝑛+1
                           ⎝                          ⎠
                                                        (7)
Figure 2: Compare the threshold of static vs incremental      Figure 3: Compare the threshold of static vs incremental
impact performance of anomaly detection (Version 1)           impact performance of anomaly detection (Version 2)



    3. Set a threshold to determine the acceptance and             • Split the data into 5 segments train on 1st seg-
       rejection regions. Items in the acceptance region             ment(static), update covariance on 2nd (online),
       are considered normal behaviour as shown in                   compare with static covariance, and calculate the
       Equation 15.                                                  error.
                                                                   • Train on 1, 2 segments (static), update covariance
                    1              1                                 on 3rd (online), compare with static covariance
        𝑝(𝑥) =               exp (− (𝑥 − 𝜇)𝑇 𝐶 −1 (𝑥 − 𝜇))
                      𝑚            2                                 and calculate the error.
                 √(2𝜋) |𝐶|
                                                       (15)        • Train on 1, 2, 3 segments (static), update covari-
       Where 𝜇 is mean vector, 𝐶 is the covariance ma-               ance on 4th (online), compare with static covari-
       trix, |𝐶| is the determinant of 𝐶 matrix, 𝑥 ∈ 𝑅𝑚 is           ance and calculate the error.
       data vector, and 𝑚 is the dimension of 𝑥 respec-            • Train on 1, 2, 3, 4 segments (static), update covari-
       tively.                                                       ance on 5th (online), compare with static covari-
                                                                     ance and calculate the error.
4. Experiment                                                   Figure 2 contains the experimental results.
Furthermore, we have provided detailed experiments on
the proposed algorithms in different realistic scenarios. 4.2. Experiment 2
However, we maintain the statistical framework provided
by the work [1] with theoretical guarantees.                The experiment setup is as follows:
   We have experimented to determine the usefulness of
                                                                 • Split the data into 5 segments.
our algorithm by creating a simulation with 10000000
                                                                 • Train on 1st segment(static), update covariance
random vectors with dimensions of 15. The repeated
                                                                   on remaining segments (2,3,4,5) (online), compare
trial shows that our algorithm is not sensitive to initial-
                                                                   with static covariance and calculate error on seg-
ization seeds and matrix dimensions. This requirement
                                                                   ments (2,3,4,5)
was a deciding factor in the choice of the evaluation met-
ric, as shown in Equation 16. We have provided more              • Train on 1, 2 segments (static), update covariance
information on the metric in Section 5.                            on remaining segments (3,4,5) (online), compare
   Our experiment will check the effect of varying the size        with  static covariance and calculate error on seg-
of the initial static window versus the update window.             ments (3,4,5)
This is shown in Subsection 4.1 and Subsection 4.2.              • Train on 1, 2, 3 segments (static), update covari-
                                                                   ance on remaining segments (4,5) (online), com-
                                                                   pare with static covariance and calculate error on
4.1. Experiment 1                                                  segments (4,5)
We evaluated the trade-off between the static window and         • Train on 1, 2, 3, 4 segment(static), update covari-
the update window. The experiment setup is as follows:             ance on remaining segments (5) (online), compare
       with static covariance and calculate error on seg-   References
       ments (5)
                                                             [1] K. M. Carter, W. W. Streilein, Probabilistic reasoning
Figure 3 contains the experimental results.                      for streaming anomaly detection, in: Proceedings of
                                                                 the Statistical Signal Processing Workshop, 2012, pp.
                                                                 377–380.
5. Result Analysis                                           [2] V. Hodge, J. Austin, A survey of outlier detection
Our random matrices get flattened into vectors and used          methodologies, Artificial Intelligence Review 22
as input. The length of the flattened vector is applied as       (2004) 85–126.
a normalization factor to make the loss metric agnostic [3] G. Ditzler, R. Polikar, Incremental learning of con-
of the matrix dimension. The loss function used in the           cept drift from streaming imbalanced data, IEEE
evaluation is Absolute Average Deviation (AAD) because           Transactions on Knowledge and Data Engineering
it gives a tighter bound on the error than mean squared          25 (2013) 2283–2301.
error (MSE) or mean absolute deviation (MAD) as shown        [4] J. F. Muth, Optimal properties of exponentially
in Equation 16. We take the average of the residuals             weighted forecasts, Journal of the American Sta-
divided by the ground truth for every sample in our eval-        tistical Association 55 (1960) 299–306.
uation set. If the residual is close to zero, we contribute  [5] G. E. P. Box, G. M. Jenkins, Time Series Analysis:
almost nothing to the measure. On the contrary, if the           Forecasting and Control, Holden-Day, 1976.
residual is high, we want to know the difference from [6] S. A. Roberts, A general class of holt-winters type
the ground truth.                                                forecasting models, Journal of Management Science
                                                                 28 (1982) 808–820.
                              𝑛
                                  𝑌𝑖̂ − 𝑌𝑖                   [7] G. U. Yule, On a Method of Investigating Period-
                    𝐴𝐴𝐷 = ∑ |              |            (16)     icities in Disturbed Series, with Special Reference
                             𝑖=1      𝑌𝑖
                                                                 to Wolfer’s Sunspot Numbers, Philosophical Trans-
   Where 𝑌𝑖̂ is the predicted value, 𝑌𝑖 is the ground truth,     actions of the Royal Society of London 226 (1927)
and 𝑛 is the length of the flattened matrix.                     267–298.
   We can observe that building your model with more [8] M. Salehi, L. Rashidi, A Survey on Anomaly Detec-
data in the init (static) phase leads to lower errors than       tion in Evolving Data: With Application to Forest
having fewer data in the init phase and using more data          Fire Risk Prediction, SIGKDD Explorations Newslet-
for an update. The observation matches our intuition             ter 20 (2018) 13–23.
because when you operate online, you tend to use smaller [9] C. Igel, T. Suttorp, N. Hansen, A computational effi-
storage space. However, there is still a performance trade-      cient covariance matrix update and a (1+1)-cma for
off compared to batch mode.                                      evolution strategies, in: Proceedings of the Genetic
   The error at the beginning of our training is signifi-        and Evolutionary Computation Conference, GECCO,
cant in both charts. This insight shows that rather than         Washington, USA, 2006.
performing the expensive operation of converting a co-
variance matrix to have positive definiteness, it is better
to use random matrices that are positive definite. More
data would help us achieve convergence as more data
arrives.
   The success of the experiments has given us confi-
dence that our multivariate anomaly detection algorithm
would have similar characteristics to the univariate case
described in the work [1].


6. Conclusion
There is no generic anomaly detection that works for ev-
ery task. The underlying assumption in this work is that
the features in use capture relevant information about
the underlying dynamics of the system. Our proposed
implementation is an anomaly detection algorithm for
handling multivariate streams, even with challenging
shifts. In future work, we will extend support for non-
stationary distributions in multivariate data streams.