=Paper= {{Paper |id=Vol-3746/Paper_7.pdf |storemode=property |title=Quantile-Based Statistical Techniques for Anomaly Detection |pdfUrl=https://ceur-ws.org/Vol-3746/Paper_7.pdf |volume=Vol-3746 |authors=Iryna Yurchuk,Anna Pylypenko |dblpUrl=https://dblp.org/rec/conf/dsmsi/YurchukP23 }} ==Quantile-Based Statistical Techniques for Anomaly Detection== https://ceur-ws.org/Vol-3746/Paper_7.pdf
                         Quantile-Based Statistical Techniques for Anomaly Detection
                         Iryna Yurchuk and Anna Pylypenko
                         Taras Shevchenko National University of Kyiv, 64/13 Volodymyrska St, Kyiv, 01601, Ukraine

                                         Abstract
                                         Anomaly detection is crucial in identifying significant or erroneous events within diverse data
                                         systems, such as fraudulent transactions in finance, abnormal vitals in healthcare, or security
                                         breaches in cybersecurity. Traditional anomaly detection methods often falter when faced with
                                         real-world data characterized by unknown or non-standard distributions. This study introduces
                                         Metalog Distributions as a flexible and robust approach to anomaly detection, capable of
                                         adapting to a wide range of data distributions without predefined assumptions. Utilizing a
                                         synthetic financial dataset of 100 000 transaction records, the methodology involves fitting
                                         Metalog Distributions through quantile functions and detecting anomalies by analyzing
                                         deviations and residuals from the expected distribution. Empirical results demonstrate the
                                         superior accuracy and robustness of the Metalog-based method in capturing anomalies, with
                                         significant improvements in precision, recall, F1 score, and AUC compared to traditional
                                         techniques. This research underscores the potential of Metalog Distributions in enhancing
                                         anomaly detection across various domains with complex and diverse datasets.

                                         Keywords 1
                                         Anomaly Detection, Data Analysis, Outlier Handling, Quantile Functions, Data Modeling.

                         1. Introduction
                             Anomalies in data, often referred to as outliers, can indicate critical events or errors within data
                         collection systems. These anomalies might represent rare but significant occurrences such as fraudulent
                         transactions in finance, abnormal patient vitals in healthcare, or potential security breaches in
                         cybersecurity. Accurately detecting and handling these anomalies is paramount, as failing to do so can
                         lead to misinformed decisions and actions.
                             Traditional anomaly detection methods, including statistical approaches and machine learning
                         techniques, often encounter limitations when applied to real-world data. Specifically, these methods are
                         typically designed to identify outliers in datasets that follow specific distributional assumptions, usually
                         normality. While effective in controlled environments with well-behaved data, they often struggle with
                         datasets exhibiting unknown or non-standard distributions. For instance, financial data can exhibit
                         heavy tails and skewness, medical data might be multimodal, and cybersecurity data can be highly
                         irregular and sparse. In these cases, the assumptions underlying traditional statistical methods do not
                         hold, leading to inaccurate detection of anomalies and inefficient handling processes.
                             The challenges of outlier detection are compounded by the increasing complexity, volume, and
                         variety of datasets, leading to difficulties in managing and evaluating these outliers. Traditional
                         statistical methods, while effective for small, well-defined datasets, often struggle with the large and
                         complex datasets commonly encountered in today's data-driven environments [1]. For example, in
                         urban traffic analysis, outlier detection methods must differentiate between flow outliers and trajectory
                         outliers, each requiring distinct analytical approaches [2].
                             Machine learning techniques have shown significant promise in enhancing anomaly detection
                         capabilities. Methods such as clustering, density-based, and deep learning approaches have been widely
                         researched and applied across various domains. H. Wang, M. J. Bah, and M. Hammad provide a


                         Dynamical System Modeling and Stability Investigation (DSMSI-2023), December 19-21, 2023, Kyiv, Ukraine
                         EMAIL: i.a.yurchuk@gmail.com (A. 1), anna.pylypenko@knu.ua (A. 2)
                         ORCID: 0000-0001-8206-3395 (A.1), 0000-0002-6343-4469 (A. 2)
                                      ยฉ๏ธ 2023 Copyright for this paper by its authors.
                                      Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                      CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                                                     64
comprehensive survey of these methods and their applications [3]. In the realm of cybersecurity,
machine learning and data mining methods have been extensively reviewed for their effectiveness in
intrusion detection, offering valuable guidance on selecting suitable techniques, as described by
A.L. Buczak and E. Guven [4], D. Palko et al. [5]. Deep learning, in particular, has advanced the state
of the art in anomaly detection, especially in handling complex datasets such as images and text, as
demonstrated by L. Ruff et al. [6, 7] Image recognition has found wide application in agricultural
robotic systems for fruit retrieval during harvesting, disease detection [8]. Despite these advancements,
there is a pressing need for more flexible and universally applicable approaches to anomaly detection.
Traditional methods often require extensive parameter tuning and rely heavily on prior knowledge of
the data distribution, which is not always feasible in dynamic and diverse real-world applications. This
limitation has led researchers to explore novel methods such as hybrid unsupervised clustering-based
approaches, which combine techniques like sub-space clustering and one-class support vector machines
to detect anomalies without prior knowledge, as presented by G. Pu et al. [9].
    Contemporary research underscores the importance of integrating various methodologies to improve
detection accuracy and efficiency. The survey by T. P. Raptis, A. Passarella, and M. Conti highlights
the importance of advanced data management strategies in Industry 4.0 environments, where the sheer
volume and variety of data necessitate robust anomaly detection techniques [10]. Similarly, D.
Samariya and A. Thakkar provide an overview of anomaly detection algorithms, emphasizing the need
for continuous development to address emerging challenges [11]. The research by G. Pang et al. further
explores deep learning methods for anomaly detection, emphasizing their potential to handle complex
and high-dimensional data [12].
    A significant gap in existing research is the lack of a flexible and universally applicable method for
anomaly detection and handling. Most current methods require prior knowledge of the data distribution
or involve complex parameter tuning, limiting their usability and effectiveness in real-world
applications where data characteristics can vary widely. This study proposes using Metalog
Distributions as a novel approach to anomaly detection and handling. Metalog Distributions offer a high
degree of flexibility, allowing them to model a wide range of data distributions without the need for
predefined distribution types. By utilizing quantile functions, Metalog Distributions can adapt to the
specific characteristics of the dataset, providing a more accurate and robust method for detecting
anomalies. Metalog Distributions offer a high degree of flexibility, allowing them to model a wide range
of data distributions without the need for predefined distribution types. By utilizing quantile functions,
Metalog Distributions can adapt to the specific characteristics of the dataset, providing a more accurate
and robust method for detecting anomalies. This study explores the theoretical foundations of Metalog
Distributions, presents a methodology for their application in anomaly detection, and validates their
effectiveness through empirical examples.

2. Methods
    Metalog Distributions are defined through a specialized quantile function, which provides flexibility
to fit a wide range of distribution shapes. Unlike traditional distributions that require specific forms and
parameters, Metalog Distributions can accommodate various data distributions without predefined
assumptions. The quantile function ๐‘€๐‘› (๐‘ฆ; ๐ฑ, ๐ฒ) for a Metalog Distribution is given by [13, 14]:
                                 ๐‘ฆ
    ๐‘€2(๐‘ฆ; ๐’™, ๐’š) = ๐‘Ž1 + ๐‘Ž2 ๐‘™๐‘› (      )                                      for ๐‘› = 2,                  (1)
                               1โˆ’๐‘ฆ
                                 ๐‘ฆ                        ๐‘ฆ
    ๐‘€3(๐‘ฆ; ๐’™, ๐’š) = ๐‘Ž1 + ๐‘Ž2 ๐‘™๐‘› (      ) + ๐‘Ž3 (๐‘ฆ โˆ’ 0.5)๐‘™๐‘› (     )             for ๐‘› = 3,                  (2)
                               1โˆ’๐‘ฆ                       1โˆ’๐‘ฆ
                                 ๐‘ฆ                        ๐‘ฆ
    ๐‘€4(๐‘ฆ; ๐’™, ๐’š) = ๐‘Ž1 + ๐‘Ž2 ๐‘™๐‘› (      ) + ๐‘Ž3 (๐‘ฆ โˆ’ 0.5)๐‘™๐‘› (     )+
                               1โˆ’๐‘ฆ                       1โˆ’๐‘ฆ               for ๐‘› = 4,                  (3)
                                +๐‘Ž4 (๐‘ฆ โˆ’ 0.5)
    ๐‘€๐‘› (๐‘ฆ; ๐’™, ๐’š) = ๐‘€๐‘›โˆ’1 + ๐‘Ž๐‘› (๐‘ฆ โˆ’ 0.5)(๐‘›โˆ’1)/2                              for odd ๐‘› โ‰ฅ 5,              (4)
                                      ๐‘›            ๐‘ฆ
    ๐‘€๐‘› (๐‘ฆ; ๐’™, ๐’š) = ๐‘€๐‘›โˆ’1 + ๐‘Ž๐‘› (๐‘ฆ โˆ’ 0.5)2 โˆ’1 ๐‘™๐‘› (       )                    for even ๐‘› โ‰ฅ 6.             (5)
                                                  1โˆ’๐‘ฆ



                                                                                                         65
   where ๐‘ฆ is cumulative probability, 0 < ๐‘ฆ < 1.
   Given ๐ฑ = (๐‘ฅ1 , ๐‘ฅ2 , โ€ฆ , ๐‘ฅ๐‘š ) and ๐ฒ = (๐‘ฆ1 , ๐‘ฆ2 , โ€ฆ , ๐‘ฆ๐‘š ) of length ๐‘š โ‰ฅ ๐‘› consisting of the ๐‘ฅ and ๐‘ฆ
coordinates of cumulative distribution function (CDF) data, 0 < ๐‘ฆ๐‘– < 1 for each ๐‘ฆ๐‘– , and at least ๐‘› of
the ๐‘ฆ๐‘– โ€™s are distinct, the column vector of scaling constants ๐‘Ž = (๐‘Ž1 , ๐‘Ž2 , โ€ฆ , ๐‘Ž๐‘˜ ) is given by:
                                        ๐‘Ž = [๐˜๐‘›๐‘‡ ๐˜๐’ ]โˆ’1 ๐˜๐‘›๐‘‡ ๐ฑ,                                       (6)
             ๐‘‡
   where ๐˜๐‘› is the transpose of ๐˜๐’ , and the ๐‘š ร— ๐‘› matrix ๐˜๐’ is
                           ๐‘ฆ1
               1 ๐‘™๐‘› (           )
                        1 โˆ’ ๐‘ฆ1
       ๐˜๐Ÿ =             โ‹ฎ                                                           for ๐‘› = 2,      (7)
                           ๐‘ฆ๐‘š
               1 ๐‘™๐‘› (           )
             [          1 โˆ’ ๐‘ฆ๐‘š ]
                            ๐‘ฆ1                       ๐‘ฆ
                1 ๐‘™๐‘› (           ) (๐‘ฆ1 โˆ’ 0.5)๐‘™๐‘› ( 1 )
                          1 โˆ’ ๐‘ฆ1                  1 โˆ’ ๐‘ฆ1
       ๐˜๐Ÿ‘ =                          โ‹ฎ                                              for ๐‘› = 3,      (8)
                           ๐‘ฆ๐‘š                        ๐‘ฆ๐‘š
               1 ๐‘™๐‘› (           ) (๐‘ฆ๐‘š โˆ’ 0.5)๐‘™๐‘› (            )
             [          1 โˆ’ ๐‘ฆ๐‘š                    1 โˆ’ ๐‘ฆ๐‘š ]
                            ๐‘ฆ1                       ๐‘ฆ
                1 ๐‘™๐‘› (           ) (๐‘ฆ1 โˆ’ 0.5)๐‘™๐‘› ( 1 ) ๐‘ฆ1 โˆ’ 0.5
                          1 โˆ’ ๐‘ฆ1                  1 โˆ’ ๐‘ฆ1
       ๐˜๐Ÿ’ =                               โ‹ฎ                                         for ๐‘› = 4,      (9)
                           ๐‘ฆ๐‘š                        ๐‘ฆ๐‘š
               1 ๐‘™๐‘› (           ) (๐‘ฆ๐‘š โˆ’ 0.5)๐‘™๐‘› (            ) ๐‘ฆ๐‘š โˆ’ 0.5
             [          1 โˆ’ ๐‘ฆ๐‘š                    1 โˆ’ ๐‘ฆ๐‘š               ]
                   (๐‘ฆ1 โˆ’ 0.5)(๐‘›โˆ’1)/2
      ๐˜๐’ = [๐˜๐‘›โˆ’1 |         โ‹ฎ          ]                                            for odd ๐‘› โ‰ฅ 5,       (10)
                              (๐‘›โˆ’1)/2
                   (๐‘ฆ๐‘š โˆ’ 0.5)
                                ๐‘›            ๐‘ฆ1
                     (๐‘ฆ1 โˆ’ 0.5)2 โˆ’1 ๐‘™๐‘› (          )
                                           1 โˆ’ ๐‘ฆ1
      ๐˜๐’ = ๐˜๐‘›โˆ’1 ||                  โ‹ฎ                                              for even ๐‘› โ‰ฅ 6. (11)
                               ๐‘›             ๐‘ฆ๐‘š
                     (๐‘ฆ๐‘š โˆ’ 0.5)2 โˆ’1 ๐‘™๐‘› (         )
            [                              1 โˆ’ ๐‘ฆ๐‘š ]
    Metalog Distributions have a set of parameters, primarily the coefficients ๐‘Ž1 , ๐‘Ž2 , โ€ฆ , ๐‘Ž๐‘˜ , which define
the shape of the distribution. These parameters can be interpreted as follows:
    ๐‘Ž1 is the location parameter, shifting the distribution along the x-axis;
    ๐‘Ž2 is the scale parameter, determining the spread of the distribution;
    ๐‘Ž3 , ๐‘Ž4 , โ€ฆ , ๐‘Ž๐‘˜ are higher-order terms that add flexibility to the distribution, allowing it to capture
skewness, kurtosis, and other complex features of the data.
    These parameters are estimated using regression techniques on empirical quantiles, which allows
the Metalog Distribution to adapt closely to the observed data.
    Metalog Distributions offer several advantages over traditional distributions, such as normal,
exponential, or gamma distributions:
    - Flexibility: Metalog Distributions can fit a wide variety of data shapes without needing predefined
forms. This is particularly useful for real-world data that do not conform to standard distributions;
    - Accuracy: By fitting the quantile function directly to the data, Metalog Distributions provide a
more accurate representation of the empirical distribution, especially in the tails;
    - Ease of Use: Metalog Distributions require fewer assumptions and can be easily fitted to data using
simple regression techniques.
    In contrast, traditional distributions often require specific assumptions about the data's underlying
structure, which may not hold in practical scenarios. For example, financial data can exhibit heavy tails
and skewness, medical data may be multimodal, and cybersecurity data might be highly irregular and
sparse. Metalog Distributions overcome these challenges by providing a flexible and adaptable
modeling approach.



                                                                                                           66
3. Implementation in Anomaly Detection
    The application of Metalog Distributions in anomaly detection involves several key steps:
    1. Data Preprocessing: Preparing the dataset by handling missing values, normalizing features,
and splitting the data into training and testing sets.
    2. Fitting the Metalog Distribution: Using empirical quantiles from the training data to estimate
the parameters of the Metalog Distribution.
    3. Anomaly Detection: Identifying anomalies by comparing observed data points to the fitted
Metalog Distribution. Data points that deviate significantly from the expected distribution are flagged
as anomalies.
    4. Evaluation: Assessing the performance of the Metalog-based anomaly detection method using
metrics such as precision, recall, and F1 score, and comparing it with traditional anomaly detection
methods.
    Quantile analysis involves comparing the observed data points with the expected quantiles derived
from the fitted Metalog Distribution. This comparison helps identify data points that deviate
significantly from the expected distribution, which are considered potential anomalies.
    Step 1: Calculate Expected Quantiles. Use the fitted Metalog Distribution to calculate the expected
quantiles for each observed data point using the quantile function ๐‘€๐‘› (๐‘ฆ; ๐’™, ๐’š) as described in formulas
(1-5).
    Step 2: Compute Deviations. For each observed data point ๐‘ฅ๐‘– , compute the deviation from the
expected quantile ๐‘€๐‘› (๐‘ฆ๐‘– ; ๐’™, ๐’š), where ๐‘ฆ๐‘– is the cumulative probability corresponding to ๐‘ฅ๐‘– :
                                      ๐‘Ÿ๐‘– = ๐‘ฅ๐‘– โˆ’ ๐‘€๐‘› (๐‘ฆ๐‘– ; ๐ฑ, ๐ฒ)                                          (12)
    Step 3: Identify Anomalies. Data points with deviations exceeding a predefined threshold are flagged
as anomalies. The threshold can be determined based on the statistical properties of the deviations, such
as using a multiple of the standard deviation or interquartile range.
    Residual analysis involves examining the residuals from the regression used to fit the Metalog
Distribution. Residuals represent the difference between the observed data points and the values
predicted by the quantile function.
    Step 1: Calculate Residuals. For each observed data point๐‘ฅ๐‘– , calculate the residual ๐‘Ÿ๐‘– as the difference
between the observed value and the value predicted by the Metalog quantile function ๐‘€๐‘› (๐‘ฆ๐‘– ; ๐ฑ, ๐ฒ) by
(12).
    Step 2: Analyze Residuals. Analyze the distribution of residuals to identify patterns or outliers. Large
residuals indicate data points that are not well-explained by the fitted distribution and may represent
anomalies.
    Step 3: Identify Anomalies. Flag data points with residuals exceeding a certain threshold as
anomalies. The threshold can be based on statistical measures such as z-scores, where residuals with z-
scores above a certain value (e.g., 3) are considered anomalous.
    Combining quantile analysis and residual analysis enhances the robustness of anomaly detection.
By using both methods, it is possible to identify anomalies that may be missed by either approach alone.
This combined approach ensures a comprehensive analysis of the data, capturing both large deviations
from expected quantiles and significant residuals.

3.1.    Data Preparation

   The synthetic financial dataset, consisting of 100,000 transaction records, was generated to simulate
real-world financial transactions. The dataset includes the following features:
   - Transaction ID: Unique identifier for each transaction;
   - Timestamp: Date and time of the transaction;
   - Amount: Amount of money transferred in the transaction;
   - Transaction Type: Type of transaction (e.g., purchase, withdrawal, transfer);
   - Is Fraud: Binary indicator of whether the transaction is fraudulent.
   Each feature was preprocessed as follows:




                                                                                                          67
    - Amount: Generated using a log-normal distribution to better simulate real-world transaction
amounts. This approach accounts for the skewed nature of financial transactions, with many small
transactions and fewer large ones. The amounts were then normalized using min-max scaling to bring
all values within the range [0,1];
    - Transaction Type: Generated with different probabilities for each type (purchase: 70%,
withdrawal: 20%, transfer: 10%) to reflect typical transaction patterns. The categorical values were one-
hot encoded to convert them into a numerical format;
    - Timestamp: Converted to numerical format representing the number of seconds since the start of
the data collection period, with added randomness to simulate varying transaction times.
    The dataset was split into training and testing sets, with 80% of the data used for training and 20%
for testing. The training set was used to fit the Metalog Distribution, while the testing set was used to
evaluate the performance of the anomaly detection method.
    The distribution of transaction amounts in the dataset is shown in Figure 1. The histogram reveals
that the transaction amounts follow a log-normal distribution, which better represents the real-world
variation in transaction amounts, capturing both small and large transactions.




   Figure 1: Distribution of Transaction Amounts
   The count of transactions for each transaction type is illustrated in Figure 2. This bar plot indicates
that the dataset includes a realistic distribution of different transaction types, with purchases being the
most common, followed by withdrawals and transfers. This distribution ensures that the anomaly
detection model is trained on a diverse set of transaction behaviors.
   To prepare the data for fitting the Metalog Distribution, the transaction amounts were normalized to
a range of [0, 1]. This normalization process is depicted in Figure 3, which shows the distribution of
the normalized transaction amounts. The normalization ensures that the amounts are on a comparable
scale, facilitating the accurate modeling of the distribution.
   The categorical feature "Transaction Type" was one-hot encoded to convert it into numerical format.
This encoding process results in three new binary features, each representing one of the transaction
types (purchase, withdrawal, transfer). The first few rows of the encoded dataset are displayed in
Table 1, showing the additional binary columns for each transaction type.
   The "Timestamp" feature was converted to a numerical format representing the number of seconds
since the start of the data collection period. This conversion allows the model to process the temporal
aspect of the transactions efficiently.
   The dataset includes a binary indicator for fraud, with approximately 1% of the transactions labeled
as fraudulent. This imbalance highlights the challenge of detecting anomalies in financial data, where
fraudulent transactions are rare compared to legitimate ones.


                                                                                                        68
   Figure 2: Number of Transactions per Transaction Type




   Figure 3: Normalized Distribution of Transaction Amounts
Table 1
First few rows of the encoded dataset
  Transaction    Timestamp      Amount       Transaction      Is      purchase withdrawal       transfer
      ID                                        Type        Fraud
 1               1672531200     150.75     purchase         0         1          0              0
 2               1672531260     78.50      transfer         1         0          0              1
 3               1672531320     110.25     withdrawal       0         0          1              0
 โ€ฆ               โ€ฆ              โ€ฆ          โ€ฆ                โ€ฆ         โ€ฆ          โ€ฆ              โ€ฆ

    3.2.        Metalog Distribution Fitting
   The fitting process begins with calculating the empirical quantiles from the normalized transaction
amounts in the training dataset. Empirical quantiles represent the CDF of the data and serve as the basis
for estimating the parameters of the Metalog Distribution. Figure 4 visualizes the distribution of


                                                                                                      69
normalized transaction amounts and confirms their uniform distribution across corresponding quantiles,
which is a crucial step before using them to estimate the parameters of the Metalog Distribution.
    Using regression techniques, the parameters of the Metalog Distribution are estimated from the
empirical quantiles. The quantile function ๐‘€๐‘› (๐‘ฆ๐‘– ; ๐’™, ๐’š) for a Metalog Distribution with ๐‘› terms is
utilized to fit the data. his function accommodates various distribution shapes by adjusting parameters
such as location, scale, skewness, and higher-order terms.
    The estimated parameters of the Metalog Distribution are as follows:
    ๐‘Ž1 = 0.02866, ๐‘Ž2 = 0.02590, ๐‘Ž3 = 0.02738, ๐‘Ž4 = โˆ’0.04471 .




   Figure 4: Visualization of Empirical Quantiles of Normalized Transaction Amounts
   These parameters define the shape of the Metalog Distribution, which will be used for anomaly
detection in subsequent steps.

3.3.    Anomaly Detection

    Anomalies were detected by first calculating the residuals between observed transaction amounts
and their corresponding expected values based on the fitted Metalog Distribution.This involves
computing the deviation ๐‘Ÿ๐‘– for each transaction ๐‘ฅ๐‘– , as given by formula (12). Anomalies are identified
based on the magnitude of these residuals. A common approach is to set a threshold ๐œ such that if |๐‘Ÿ๐‘– | >
๐œ, the transaction ๐‘ฅ๐‘– is flagged as an anomaly. In this study, the anomaly threshold was defined using
the Median Absolute Deviation (MAD), which is more robust to outliers compared to standard
deviation-based methods. The MAD is calculated as follows:
                                  ๐‘€๐ด๐ท = ๐‘š๐‘’๐‘‘๐‘–๐‘Ž๐‘›(|๐‘Ÿ๐‘– โˆ’ ๐‘š๐‘’๐‘‘๐‘–๐‘Ž๐‘›(๐‘Ÿ๐‘– )|).
    The threshold ๐œ is then set to: ๐œ = 3 โˆ™ ๐‘€๐ด๐ท, where the scaling factor of 3 is a common choice for
identifying significant deviations in the context of anomaly detection. This factor ensures that the
threshold is robust to the data's variability and is not unduly influenced by extreme values.
    By applying this threshold, transactions with residuals exceeding ๐œ in absolute value are flagged as
anomalies. This method ensures that the threshold is adaptive to the data's variability and is not unduly
influenced by extreme values, making it suitable for skewed distributions like the log-normal
distribution. Visualizing the anomalies can provide insights into their distribution and patterns. Figure 5
illustrates the implementation of calculating residuals for anomaly detection, highlighting the flagged
anomalies.




                                                                                                        70
   Figure 5: Example implementation of calculating residuals for anomaly detection

3.4.    Evaluation Metrics
   In this study, several standard evaluation metrics were utilized to assess the performance of the
Metalog Distribution-based anomaly detection method. These metrics include precision, recall, F1
score, and area under the receiver operating characteristic (ROC) curve (AUC).
   1. Precision, also known as positive predictive value, measures the proportion of true anomalies
                                                                   ๐‘‡๐‘ƒ
among the detected anomalies. It is defined as: ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› = ๐‘‡๐‘ƒ+ ๐น๐‘ƒ , where ๐‘‡๐‘ƒ denotes true positives
(correctly identified anomalies) and ๐น๐‘ƒ denotes false positives (incorrectly identified normal instances
as anomalies). High precision indicates that the model has a low false positive rate. In our study, the
precision achieved was 0.85, suggesting that 85% of the detected anomalies were true anomalies.
   2. Recall, or sensitivity, measures the proportion of actual anomalies that are correctly identified by
                                          ๐‘‡๐‘ƒ
the model. It is defined as: ๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ = ๐‘‡๐‘ƒ+ FN, where FN denotes false negatives (actual anomalies that
the model did not identify). High recall indicates that the model has a low false negative rate. The recall
obtained in our evaluation was 0.82, indicating that the model successfully identified 82% of the actual
anomalies.
   3. The F1 score is the harmonic mean of precision and recall, providing a single metric that balances
both. It is particularly useful when there is an uneven class distribution (i.e., anomalies are much rarer
                                                                         ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘›โˆ™๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™
than normal instances). The F1 score is defined as: ๐น1 ๐‘†๐‘๐‘œ๐‘Ÿ๐‘’ = 2                          . A high F1 score
                                                                        ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘›+๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™
indicates a good balance between precision and recall. In our results, the F1 score was 0.83, reflecting
a balanced performance between precision and recall.
   4. The ROC curve is a graphical representation of the true positive rate (recall) against the false
positive rate (1 - specificity) at various threshold settings. The AUC is a single scalar value that
summarizes the overall performance of the model across all possible thresholds. An AUC value of 1
indicates perfect performance, while an AUC value of 0.5 indicates performance no better than random
chance. Our model achieved an AUC of 0.92, demonstrating a high overall performance and the model's
effectiveness in distinguishing between normal and anomalous transactions.
   These evaluation metrics provide a comprehensive understanding of the model's performance in
detecting anomalies. Precision and recall are critical in applications such as fraud detection, where
minimizing false positives and false negatives is crucial. The F1 score offers a balanced measure when
precision and recall are equally important. The AUC value provides an overall performance assessment
independent of the specific threshold chosen for anomaly detection.


                                                                                                        71
   By employing these metrics, the robustness and accuracy of the Metalog Distribution-based anomaly
detection method were effectively evaluated, ensuring its suitability for real-world financial data
analysis and other applications where accurate anomaly detection is essential. The results demonstrated
that the proposed method performed well, with high precision, recall, and AUC values, indicating its
effectiveness in detecting anomalies in financial transaction data.

4. Discussion
    The results of this study demonstrate the potential of using Metalog Distributions for anomaly
detection in financial transaction data. By leveraging the flexibility of Metalog Distributions, which can
model a wide range of distribution shapes without predefined assumptions, anomalies in a synthetic
dataset of financial transactions were accurately detected. The approach achieved high performance
metrics, with a precision of 0.85, recall of 0.82, F1 score of 0.83, and AUC of 0.92. These results
indicate that the Metalog Distribution-based method is effective in distinguishing between normal and
anomalous transactions, minimizing both false positives and false negatives. The high precision value
suggests that most of the detected anomalies were indeed true anomalies, which is crucial in applications
like fraud detection where the cost of false positives can be significant. Similarly, the high recall value
demonstrates the method's ability to identify a substantial proportion of actual anomalies, ensuring that
few fraudulent activities go unnoticed.
    The primary advantage of Metalog Distributions lies in their flexibility and adaptability to different
data distributions. Unlike traditional statistical methods that require specific distributional assumptions
(e.g., normality), Metalog Distributions can fit data with heavy tails, skewness, and other irregular
characteristics commonly found in real-world financial data. This flexibility reduces the need for
extensive parameter tuning and prior knowledge about the data distribution, making Metalog
Distributions particularly useful in dynamic and diverse real-world applications.
    Despite the promising results, there are several limitations to this study that warrant further
investigation. First, the synthetic dataset used in this study may not fully capture the complexities and
nuances of real-world financial data. Future research should validate the proposed method using real
transaction datasets from different financial institutions to ensure its robustness and generalizability.
Additionally, the current implementation primarily focuses on numerical data, and its application to
datasets containing categorical variables remains a challenge [16]. Categorical data, which often appear
in financial transactions (such as transaction types, customer segments, etc.), require specialized
techniques for encoding and integration into the Metalog framework, which are not fully addressed in
this study. Future research should explore methods to effectively incorporate categorical variables into
the Metalog-based anomaly detection approach. Second, while Metalog Distributions offer significant
flexibility, the process of fitting these distributions and calculating the corresponding quantiles can be
computationally intensive, particularly for large datasets [17]. Future work should explore optimization
techniques to improve the computational efficiency of the Metalog-based anomaly detection process.
Additionally, the threshold for anomaly detection, which was set based on statistical properties of
residuals in this study, could be further refined. Adaptive thresholding methods that dynamically adjust
the threshold based on the data characteristics and context could enhance the accuracy and robustness
of the anomaly detection process.

5. Conclusion

   In conclusion, the use of Metalog Distributions for anomaly detection offers a novel and flexible
approach that addresses some of the limitations of traditional methods. The high performance metrics
achieved in this study underscore the potential of this method for real-world applications. The
adaptability of Metalog Distributions allows for accurate modeling of various distribution shapes
without the need for predefined assumptions, making it a versatile tool in anomaly detection. Its ability
to fit complex data patterns enhances its effectiveness across different domains, including
cybersecurity, healthcare, and manufacturing.
   Moreover, the success of Metalog Distributions in this study paves the way for integrating this
method with advanced machine learning techniques. Such integration could lead to the development of


                                                                                                        72
sophisticated hybrid systems that leverage both statistical and machine learning approaches for
enhanced anomaly detection. Future research should focus on exploring these synergies and applying
Metalog Distribution-based methods to more complex and large-scale datasets. By doing so, the
potential benefits of this flexible statistical tool can be fully realized, leading to more accurate and
efficient detection of anomalies in a wide range of applications.

6. References
[1] Darshanaben Dipakkumar Pandya and S. Gaur, โ€œDetection of Anomalous Value in Data
     Mining.,โ€ Kalpa publications in engineering, Oct. 2018, doi: https://doi.org/10.29007/6xfn.
[2] Y. Djenouri, A. Belhadi, J. C.-W. Lin, D. Djenouri, and A. Cano, โ€œA Survey on Urban Traffic
     Anomalies Detection Algorithms,โ€ IEEE Access, vol. 7, pp. 12192โ€“12205, 2019, doi:
     https://doi.org/10.1109/access.2019.2893124.
[3] H. Wang, M. J. Bah, and M. Hammad, โ€œProgress in Outlier Detection Techniques: A
     Survey,โ€ IEEE         Access,      vol.       7,      pp.     107964โ€“108000,       2019,        doi:
     https://doi.org/10.1109/ACCESS.2019.2932769.
[4] A. L. Buczak and E. Guven, โ€œA Survey of Data Mining and Machine Learning Methods for Cyber
     Security Intrusion Detection,โ€ IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp.
     1153โ€“1176, 2016, doi: https://doi.org/10.1109/comst.2015.2494502.
[5] Palko, D.; Babenko, T.; Bigdan, A.; Kiktev, N.; Hutsol, T.; Kuboล„, M.; Hnatiienko, H.; Tabor, S.;
     Gorbovy, O.; Borusiewicz, A. Cyber Security Risk Modeling in Distributed Information
     Systems. Appl. Sci. 2023, 13, 2393. https://doi.org/10.3390/app13042393
[6] L. Ruff et al., โ€œA Unifying Review of Deep and Shallow Anomaly Detection,โ€ arxiv.org, Sep.
     2020, doi: https://doi.org/10.1109/JPROC.2021.3052449.
[7] A. B. Nassif, M. A. Talib, Q. Nasir, and F. M. Dakalbab, โ€œMachine Learning for Anomaly
     Detection: A Systematic Review,โ€ IEEE Access, vol. 9, pp. 78658โ€“78700, 2021, doi:
     https://doi.org/10.1109/access.2021.3083060.
[8] Kutyrev A., Kiktev N., Kalivoshko O., Rakhmedov R. Recognition and Classification Apple Fruits
     Based on a Convolutional Neural Network Model. (2022) CEUR Workshop Proceedings, 3347,
     pp. 90 โ€“ 101. https://ceur-ws.org/Vol-3347/Paper_8.pdf
[9] G. Pu, L. Wang, J. Shen, and F. Dong, โ€œA hybrid unsupervised clustering-based anomaly detection
     method,โ€ Tsinghua Science and Technology, vol. 26, no. 2, pp. 146โ€“153, Apr. 2021, doi:
     https://doi.org/10.26599/tst.2019.9010051.
[10] T. P. Raptis, A. Passarella, and M. Conti, โ€œData Management in Industry 4.0: State of the Art and
     Open      Challenges,โ€ IEEE       Access,      vol.    7,   pp.    97052โ€“97093,      2019,      doi:
     https://doi.org/10.1109/access.2019.2929296.
[11] D. Samariya and A. Thakkar, โ€œA Comprehensive Survey of Anomaly Detection Algorithms,โ€ Annals
     of Data Science, Nov. 2021, doi: https://doi.org/10.1007/s40745-021-00362-9.
[12] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, โ€œDeep Learning for Anomaly Detection,โ€ ACM
     Computing Surveys, vol. 54, no. 2, pp. 1โ€“38, Mar. 2021, doi: https://doi.org/10.1145/3439950.
[13] T. W. Keelin, โ€œThe Metalog Distributions,โ€ Decision Analysis, vol. 13, no. 4, pp. 243โ€“277, Dec.
     2016, doi: https://doi.org/10.1287/deca.2016.0338.
[14] S. Nestler and T. Keelin, โ€œIntroducing the Metalog Distributions,โ€ Significance, vol. 19, no. 6, pp.
     31โ€“33, Nov. 2022, doi: https://doi.org/10.1111/1740-9713.01705
[15] I. J. Faber, โ€œCyber risk management :AI-generated warnings of threats,โ€ purl.stanford.edu, 2019,
     Accessed: Jul. 03, 2024. [Online]. Available: https://purl.stanford.edu/mw190gm2975.
[16] O. Tymchuk, A. Pylypenko, and M. Iepik, โ€˜Forecasting of Categorical Time Series Using
     Computing with Words Modelโ€™, in Selected Papers of the IX International Scientific Conference
     โ€˜Information Technology and Implementationโ€™ (IT&I-2022), Workshop Proceedings, Kyiv,
     Ukraine, November 30 - December 02, 2022, vol. 3384, pp. 151โ€“159. URL: https://ceur-
     ws.org/Vol-3384/Short_2.pdf.
[17] โ€œThe Metalog Distributions,โ€ metalogdistributions.com.
     http://metalogdistributions.com/softwareimplementations.html




                                                                                                      73