<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>November</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Machine Learning Techniques for Anomaly Detection Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iulia Khlevna</string-name>
          <email>yuliya.khlevna@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bohdan Koval</string-name>
          <email>bohkoval@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>Volodymyrska str., 60, Kyiv, 01033</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>2</volume>
      <fpage>0</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Efficient processing of extensive datasets is crucial in data-driven applications, particularly for anomaly detection. This article explores the application of parallel and distributed machine learning techniques to enhance anomaly detection systems. Parallel training, including model and data parallelism, significantly reduces processing times in data preprocessing, feature engineering, and model training. Experiments demonstrate notable efficiency improvements, especially for extensive datasets. Distributed training is crucial in scenarios with time-intensive training, storage constraints, data localization, or RAM limitations. The approach to parallelize learning process has been demonstrated, which allowed to speed up the processing time by 2x via allocating 4x resources. Also, it depicts how the process can be spread across multiple machines using distributed techniques. In conclusion, this article emphasizes the pivotal role of parallel and distributed machine learning techniques in accelerating the development of anomaly detection systems. These techniques empower organizations to address challenges posed by extensive datasets and real-time data streams effectively. As data-driven applications evolve, these advanced computing methods promise more effective and timely responses to anomalies, solidifying their significance in the era of big data. machine learning, distributed computing, parallel processing, big data, scalable systems Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In traditional machine learning approaches, data processing typically revolves around a single
computational process. For instance, in widely-used machine learning libraries like scikit-learn, the
default setting often employs a single processing core. While this method works effectively for
relatively small datasets, typically measured in kilobytes or megabytes, it becomes inefficient and
impractical when dealing with substantial volumes of data, particularly in the gigabytes or even
terabytes range. Notably, the primary issue here is the prolonged processing times, especially during
the training phase.</p>
      <p>In modern data-driven applications, the need to analyze and extract insights from large datasets has
become increasingly common. Anomaly detection systems, which are essential for identifying subtle
irregularities or deviations, require the capability to process vast amounts of data.</p>
      <p>An urgent scientific task based on the above rises, which consists of the development of the concept
to tackle the significant computational challenges posed by such large datasets. This has given rise to
the concept of parallel and distributed machine learning, which leverages the power of multiple
processing units or distributed computing resources to speed up the analysis of massive datasets and
enhance the overall efficiency of anomaly detection systems.</p>
      <p>In this article, we explore the world of parallel and distributed machine learning and its application
in anomaly detection systems. We will examine the theoretical foundations, practical implementations,</p>
      <p>2023 Copyright for this paper by its authors.
CEUR</p>
      <p>ceur-ws.org
and the numerous advantages it brings to the complex task of detecting anomalies in extensive datasets.
By adopting these advanced computational techniques, our goal is to alleviate the time and resource
constraints that have traditionally hindered real-time anomaly detection in large-scale data
environments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature overview</title>
      <p>
        Ben-Nun and Hoefler, in "Demystifying Parallel and Distributed Deep Learning: An In-Depth
Concurrency Analysis" [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], offer a profound exploration into the intricacies of parallel and distributed
deep learning. Their work dissects the critical issue of achieving concurrency in distributed settings, a
fundamental aspect relevant to anomaly detection systems.
      </p>
      <p>
        Aach, Inanc, Sarma, Riedel, and Lintermann, in "Large scale performance analysis of distributed
deep learning frameworks for convolutional neural networks" [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], delve into the performance analysis
of distributed deep learning frameworks. With a focus on convolutional neural networks, this study
investigates the scalability and efficiency of distributed deep learning algorithms - a pivotal area of
study for implementing anomaly detection systems at scale.
      </p>
      <p>
        In the context of real-time data streams, Esmaeilzadeh, Salajegheh, Ziai, and Boote, in "Abuse and
Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning" [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], provide
practical insights into the application of heuristic-aware machine learning techniques for abuse and
fraud detection. Their work demonstrates the importance of heuristic-aware approaches in identifying
anomalies in dynamic data sources, a crucial component of anomaly detection systems.
      </p>
      <p>
        "Parallel and Distributed Training of Deep Neural Networks: A brief overview" by Attila Farkas,
Gábor Kertész, Róbert Lovas [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] focuses on the parallel and distributed training of deep neural
networks, a fundamental topic in modern machine learning. The authors likely discuss techniques and
methods for training deep neural networks across multiple processors or nodes to expedite the training
process and handle large datasets efficiently. Parallelization and distribution are crucial for making deep
learning models practical for real-world applications.
      </p>
      <p>
        In "Toward Parallel and Distributed Learning by Meta-Learning" by Philip K. Chan and Salvatore
J. Stolfo [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the authors explore the concept of meta-learning in the context of parallel and distributed
machine learning. Meta-learning involves developing models that can adapt to new tasks quickly, and
the paper likely discusses how this concept can be used to enhance the scalability and efficiency of
machine learning in parallel and distributed systems. It may propose methods and frameworks for
leveraging meta-learning to improve anomaly detection.
      </p>
      <p>
        The work authored Ron Bekkerman, Mikhail Bilenko, John Langford named "Scaling up Machine
Learning: Parallel and Distributed Approaches" [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] deals with the scaling up of machine learning using
parallel and distributed approaches. The authors probably discuss a wide range of techniques and
strategies for handling large datasets and complex models in parallel and distributed environments.
Such methods are essential for anomaly detection systems, which often deal with massive amounts of
data and require efficient algorithms to identify anomalies.
      </p>
      <p>
        The research "Boosting Algorithms for Parallel and Distributed Learning" by Aleksandar Lazarevic
and Zoran Obradovic [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] delves into the application of boosting algorithms in the context of parallel
and distributed learning. Boosting is a machine learning ensemble technique that combines multiple
weak learners to create a strong learner. The paper probably explores how boosting algorithms can be
adapted and optimized for parallel and distributed systems to improve anomaly detection performance.
      </p>
      <p>Collectively, these works contribute to our comprehensive understanding of parallel and distributed
machine learning. They address various facets of the topic, ranging from the optimization of
concurrency to patterns, performance analysis, and real-world applications in distributed environments.
As the foundation for our investigation into parallel and distributed machine learning for anomaly
detection systems, these studies inform the development of scalable and efficient solutions to the
challenges posed by large-scale datasets and real-time data streams.</p>
      <p>The examination of existing literature has revealed the critical importance of research endeavors that
focus on accelerating the training process of machine learning models, particularly in the context of
today's era characterized by the prevalence of big data and data-intensive applications. This underscores
the necessity of exploring techniques and strategies aimed at expediting the learning phase, as it directly
addresses the challenges posed by the vast volumes of data that modern applications deal with.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Purpose and objectives of the research</title>
      <p>The purpose of this research is to explore how parallel and distributed machine learning techniques
can be harnessed to make anomaly detection systems operate with greater speed and efficiency. We
seek to comprehend how these techniques can be practically integrated into various critical domains
such as fraud detection, cybersecurity, and healthcare, where identifying anomalies is of paramount
importance. Our primary objectives in this endeavor encompass a comprehensive understanding of
parallel machine learning, which involves studying the methods through which computational tasks are
divided and assigned to multiple processors or nodes. This distributed approach holds the promise of
expediting the training process and enabling the scalability of anomaly detection systems.</p>
      <p>Furthermore, we aim to delve into the intricacies of distributed machine learning, which extends
beyond the mere distribution of computational workloads. It entails the dispersion of data and
computation across multiple machines or clusters, ultimately contributing to more efficient learning
processes, scalability, and adaptability. A crucial aspect of our research is the quantitative assessment
of the actual speed improvement achieved through the application of parallel and distributed machine
learning techniques. This necessitates rigorous benchmarking and performance evaluation to precisely
measure the efficiency gains and the reduction in processing time.</p>
      <p>To achieve this goal, the following tasks must be solved:
1. Differentiate parallel and distributed machine learning.
2. Formulate parallel machine learning technique and analyze its performance
3. Elaborate on distributed machine learning technique (where the process is spread not across
multiple cores, but physical machines over network) and how we can apply it.
can make use of this feature to distribute the computation across multiple CPU cores, which
can drastically reduce processing time.
●</p>
      <p>Distributed Computing: If your dataset is extremely large and cannot fit in memory, consider
using distributed computing frameworks like Apache Spark. These frameworks can distribute
the data and computation across a cluster of machines.</p>
      <p>Below it the comparison of these 2 approaches (Table 1):</p>
      <p>In the realm of machine learning, parallel and distributed computing techniques have garnered
increasing attention due to their significant impact on the efficiency and scalability of anomaly detection
systems.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Parallel machine learning algorihms 4.1.</title>
    </sec>
    <sec id="sec-5">
      <title>Understanding parallel training</title>
      <p>
        Parallelism, at its core, is a strategic framework aimed at overcoming the limitations posed by the
size of large machine learning models and enhancing the overall training efficiency. Within this
framework, there are two primary types of parallelism, each tailored to distinct objectives. The first is
model parallelism, which focuses on dividing a large model into smaller, manageable components that
can be trained concurrently. The second is data parallelism, a technique where subsets of the dataset are
distributed to multiple processing units for simultaneous training [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>When you parallelize a task, the idea is that if you double the number of processing elements, it
should take only half the time, and if you double it again, the time should halve once more. However,
in reality, only a few parallel algorithms can achieve this perfect speedup. Most of them show nearly
linear speedup when you have a small number of processing elements, but as you add more, the speedup
tends to level off and remains relatively constant.</p>
      <p>The possible increase in the speed of an algorithm when it runs on a parallel computing platform is
determined by Amdahl's law:
 
( ) =</p>
      <p>
        =
1 −  +  /
 +  (1 −  )
where Slatency represents the potential reduction in the time it takes to complete the entire task, s the
reduction in time specifically for the part of the task that can be done in parallel, p is the portion of the
total task time that is spent on the part that can be parallelized before parallelization [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Since Slatency &lt; 1/(1 - p), it indicates that a small portion of the program that cannot be parallelized
will restrict the maximum speedup achievable through parallelization. A program that tackles a
substantial mathematical or engineering problem usually includes both parts that can be parallelized
and parts that must be executed sequentially. If the non-parallelizable segment of a program makes up
10% of the total runtime (with p = 0.9), the speedup cannot exceed 10 times, regardless of how many
processors are added. This sets a limit on the benefit of adding more parallel execution units. In simple
terms, if a task cannot be divided due to sequential constraints, putting in more effort won't accelerate
the schedule.</p>
      <p>
        The picture (Fig. 1) provides an insight into the strategies employed for data and weight partitioning
within the context of Google Switch Transfers. The Switch Transformer represents an innovative
adaptation of the traditional transformer architecture by introducing the concept of a switch
feedforward neural network (FFN) layer. The fundamental departure from convention lies in the fact that,
unlike the typical transformer FFN layer, the switch layer incorporates multiple FFNs, which are
referred to as "experts." In the operation of this layer, each individual token undergoes a distinct process.
It commences its journey by traversing a router function, responsible for directing the token to a specific
FFN expert. Therefore, this process can be parallelized [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this representation, each 4×4 grid demarcated by dotted lines symbolizes a collective of 16 cores.
The shaded squares within these grids represent the data residing on each core, which can include either
model weights or a batch of tokens. This illustration serves to elucidate how model weights and data
tensors are partitioned for each strategy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The first row in this visual representation offers a glimpse into the division of model weights among
the cores. Different-sized shapes in this row signify the presence of larger weight matrices, notably in
the Feed Forward Network (FFN) layers, denoted by more extensive "dff" sizes. Each color attributed
to the shaded squares designates a distinct weight matrix. It's noteworthy that the number of parameters
allocated to each core remains constant, although larger weight matrices entail a more substantial
computational workload for each token.</p>
      <p>
        Moving on to the second row, it unveils the manner in which a data batch is distributed across the
cores. In this case, each core accommodates an identical number of tokens, ensuring a uniform memory
utilization across all the partitioning strategies. These strategies exhibit varying properties, allowing for
each core to harbor tokens in the same or different colors, resulting in distinctive arrangements across
the cores [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>This depiction serves as a visual guide to comprehend the intricacies of data and weight partitioning
strategies in Google Switch Transfers. It showcases the flexibility and versatility in distributing model
weights and data, crucial for optimizing the efficiency and performance of deep learning models.
4.2.</p>
    </sec>
    <sec id="sec-6">
      <title>Applying parallel training</title>
      <p>Let’s apply parallel machine learning and compare to sequential processing. Speed is of the essence
when developing and fine-tuning machine learning models, especially in constrained environments like
kernels competitions. Faster iterations over various ideas can lead to better-performing models. In the
realm of anomaly detection, the need for rapid data processing and model training is particularly critical.
This is where parallel and distributed computing comes to the rescue.</p>
      <p>In the forthcoming example, we will undertake a comparison between the sequential execution of a
model and its parallel counterpart utilizing four threads. This comparison will be conducted within the
Python environment, with particular emphasis on its C implementation.</p>
      <p>Parallel processing can significantly enhance the speed of data pipelines, and in Python, this can be
easily achieved by dividing the main input data into parts and concurrently applying functions to these
segments. Two powerful libraries, “multiprocessing” and “joblib”, facilitate this parallelism. Let's take
a closer look at how parallel data processing can be implemented in a kernel environment.</p>
      <p>In the kernel environment, where efficiency is paramount, a timer function is employed to measure
the execution time of specific operations. This allows us to precisely gauge the impact of parallel
processing on our data processing pipelines. The split_df function is used to divide the main DataFrame
(df_)intomultiplesmaller DataFrames, equalinsize. This makes it easier toapplyparallelfunctions to
these mini-DataFrames.</p>
      <p>def timer(name):
t0 = time.time()
yield
print('{0} done in {1:.3f} s.'.format(name, time.time() - t0))
def split_df(df, num_splits):
# Split the DataFrame into 'num_splits' equal parts
# ...</p>
      <p>return df_list</p>
      <p>For instance, you can split the main DataFrame into four parts, apply functions to these
subDataFrames inparallel, and later reassemblethem toensuretheintegrityof the data.</p>
      <p>Next, we implement thefollowingfunctions:
● datetime_proc: Converts the 'time' column into datetime format for more efficient
timebasedanalysis.</p>
      <p>● create_time_resolutions: Creates features that capture information about the time
dimension, suchas hour, day, and week of the year.</p>
      <p>● rename_columns:Renames columns after groupingtofacilitateeasier mergingandaccess.
● create_grouped_df and create_grouped_df_shifted: These functions create
groupedfeatures basedonspecific criteria, allowingfor more detailedanalysis andanomalydetection.</p>
      <p>By using parallel processing, we can significantly reduce the time it takes to transform and prepare
your data, allowingyou to experiment withvarious anomalydetectiontechniques more efficiently.</p>
      <p>Next, let's examine the process of creating datetime features, comparing sequential and parallel
approaches:
with timer('datetime processing:'):</p>
      <p>df_ = create_time_resolutions(df_)
Datetimeprocessing: done in19.341 seconds.
with timer('pool datetime processing:'):
with Pool(processes=4) as pool:</p>
      <p>pool.map(create_time_resolutions, dfs_split)
with timer('pool datetime processing threads:'):
with ThreadPool(processes=4) as pool:</p>
      <p>pool.map(create_time_resolutions, dfs_split)</p>
      <sec id="sec-6-1">
        <title>Pool datetime processing: donein9.120 seconds. Pool datetime processingthreads: done in19.125 seconds.</title>
        <p>with timer('joblib parallel datetime processing:'):</p>
        <p>Parallel(n_jobs=4)(delayed(create_time_resolutions)(i) for i
in dfs_split)
with timer('joblib parallel datetime processing threads:'):</p>
        <p>Parallel(n_jobs=4,prefer='threads')</p>
      </sec>
      <sec id="sec-6-2">
        <title>Joblib parallel datetime processing: done in16.912 seconds. Joblib parallel datetime processingthreads: done in19.213 seconds. Final comparisoncanbeseenonTable2. 185</title>
        <sec id="sec-6-2-1">
          <title>Processing approach</title>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>Regular (sequential) processing</title>
      </sec>
      <sec id="sec-6-4">
        <title>Pool parallel processing</title>
      </sec>
      <sec id="sec-6-5">
        <title>Pool parallel processing threads</title>
      </sec>
      <sec id="sec-6-6">
        <title>Joblib parallel processing</title>
      </sec>
      <sec id="sec-6-7">
        <title>Joblib parallel processing threads</title>
        <sec id="sec-6-7-1">
          <title>Result</title>
          <p>19.341 seconds
9.120 seconds
19.125 seconds
16.912 seconds
19.213 seconds</p>
          <p>Therefore, parallel processing is typically faster than sequential processing. However, it's worth
noting an interesting observation here: when using threads as the multiprocessing backend, parallel
processing can be noticeably slower. In some cases, it might even be slower than sequential processing.
In theory, both joblib and pool should yield similar results. It appears that kernel variability plays a
significant role. For example, in one run, multiprocessing was much quicker (9 seconds vs. 19 seconds),
while in the next run they showed almost identical results (19 seconds vs. 19 seconds).</p>
          <p>Now, let's delve into parallel feature engineering. In this scenario, we'll be creating several groupings
that will act as inputs for the pandas.groupby function. The exciting part is that these groupings will be
processed concurrently, demonstrating the power of parallel processing.</p>
          <p>with timer('grouping features:'):</p>
          <p>for i in groupings:
dfs_proc.append(create_grouped_df(df_time,
columns_set=market_cols_agg_num))
group_by=i,
with timer('grouping features shifted:'):
for i in groupings:</p>
          <p>dfs_proc_shift.append(create_grouped_df_shifted(df_time,
group_by=i, columns_set=market_cols_agg_num))</p>
        </sec>
      </sec>
      <sec id="sec-6-8">
        <title>Grouping features: done in 1.721 seconds. Grouping features shifted: done in 1.981 seconds. For multiprocessing we depict the example of code for pool grouping features parallel. For other 3 methods the code is identical, only with different parameters.</title>
        <p>with timer('pool grouping features parallel:'):
with Pool(processes=4) as pool:</p>
        <p>pool.starmap(#datasets go here as params)</p>
        <p>Pool grouping features parallel: done in 2.820 seconds. Pool grouping features parallel threads: done
in 1.101 seconds. Pool grouping features shifted parallel: done in 2.521 seconds. Pool grouping features
shifted parallel threads: done in 1.226 seconds. Joblib grouping features parallel threads: done in 1.009
seconds. Joblib grouping features shifted parallel threads: done in 1.182 seconds (Table 3).</p>
        <p>The disparities between parallel and sequential processing seem less pronounced for these
groupings. This could be attributed to the inherent efficiency of the grouping operation, making the
performance improvements less evident. It's advisable to evaluate the performance of the parallel
version before incorporating it into parallel processing methods. An intriguing observation is that, in
this context, the use of threads as the backend appears to be notably faster for multiprocessing.
Also, we can notice the advantage of ThreadPool over Pool, which comes from their nature:
● Lower Overhead: The key distinction between ThreadPool and Pool is the way they manage
parallel execution. Pool uses separate processes, while ThreadPool uses threads within a single process.
Creating and managing processes is generally more resource-intensive and time-consuming than
threads. Threads have lower overhead, as they share memory space and other resources, making them
quicker to start and manage.</p>
        <p>● Shared Memory: Threads within the same process share the same memory space, which can
lead to more efficient data sharing and communication. In contrast, processes have their own memory
space and need to use inter-process communication mechanisms to share data. This shared memory in
ThreadPool can be advantageous for tasks that require a lot of data sharing.</p>
        <p>● GIL (Global Interpreter Lock): In the context of Python, the Global Interpreter Lock (GIL) is
a mutex that allows only one thread to execute in the Python interpreter at a time. This means that even
in a multi-threaded environment, only one thread can execute Python bytecode at a time. While this
might seem like a limitation, it's not as significant when you're dealing with tasks that are I/O-bound
(e.g., file I/O, network requests), as the GIL is often released during I/O operations. In such cases, using
threads with ThreadPool can be more efficient.</p>
        <p>● Task Nature: The choice between ThreadPool and Pool can also depend on the nature of the
tasks you're parallelizing. If you have CPU-bound tasks (tasks that require a lot of computation), then
using separate processes (Pool) may provide better performance because of true parallelism on
multicore CPUs. On the other hand, for I/O-bound tasks or tasks that are heavily reliant on efficient data
sharing, threads (ThreadPool) might be faster due to the reasons mentioned above.</p>
        <p>Overall, we can sum up that parallel processing can benefit the training speed in both cases, but the
bump in the speed depends on the input size (the bigger size - the better speed improvement). Generally
speaking, we can see 2x speed boost with allocating 4x resources.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Distributed machine learning</title>
      <p>
        Distributed training is the practice of training machine learning algorithms across multiple machines,
with the primary objective of enhancing scalability. In essence, it enables the handling of larger datasets
by integrating more machines into the training infrastructure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The benefits of distributed training are evident in the increased availability of CPUs and greater
bandwidth, facilitating the processing of substantial volumes of data. Nonetheless, harnessing this
added capacity effectively remains a significant challenge.
Distributed training becomes a necessity under the following circumstances:
1. Time-Intensive Training: When the training process consumes a substantial amount of time,
especially in scenarios with vast datasets such as large text corpora, extensive image datasets, video
data for autonomous vehicles, medical imaging data, or satellite imagery.</p>
      <p>2. Storage Constraints: When storing the data on a single machine is unfeasible due to its size.
3. Data Localization: When data accessibility is confined to specific locations, often due to
security requirements or the immense size of datasets (common in large enterprises or astro data).</p>
      <p>
        4. RAM Constraints: When the data must be entirely accommodated in RAM for preprocessing,
making it necessary to employ distributed training for efficient handling [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Distributed training is a powerful approach but not without its complexities. Let's delve into the
specifics of data parallelism, a technique focusing on the distribution of data across multiple machines.
Note that we won't explore model parallelism in this context, as it deals with distributing machine
learning models across several machines [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The process of data parallelism typically involves the following steps:
1. Model Initialization: Initiate the machine learning model on the primary server, which serves
as the central point for model management.</p>
      <p>2. Model Distribution: Workers, the machines responsible for computation, retrieve a copy of the
model from the primary server. Each worker works on a subset of the data.</p>
      <p>3. Gradient Calculation: Workers perform calculations on their respective batches of training
data, computing the gradients of the loss function with respect to the model's parameters.</p>
      <p>4. Gradient Communication: After calculating the gradients, workers transmit this information
back to the primary server, which serves as a hub for collecting and processing these updates.</p>
      <p>5. Model Update: The primary server receives the gradients from all workers and utilizes them to
update the model weights, implementing changes based on the collected gradient information.</p>
      <p>
        6. Iterative Process: The process repeats, with workers fetching the updated model, recalculating
gradients, and sending them for further model refinement. This iterative cycle continues until the
training process reaches the desired performance level [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>These steps collectively constitute the data parallelism approach for distributed training, allowing
for efficient model training on vast datasets with the distributed computational power of multiple
machines.</p>
      <p>Numerous methods and techniques are available to effectively manage gradient computations when
dealing with distributed data across multiple machines.</p>
      <p>
        Synchronous updates, in which the main server waits for each worker to complete their gradient
computations before executing a model update and redistributing it to all workers for the next iteration,
can be a somewhat slow process. This is primarily due to the process being bottlenecked by the slowest
worker's progress. To address this issue, several approaches can be employed [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]:
1. Backup Workers: This strategy involves having backup workers available. In this setup, the
main server ceases to wait as soon as a certain number, N, of workers have completed their assigned
tasks. While this approach accelerates the training process by reducing the wait time for slower workers,
it comes at the cost of increased gradient variance. Smaller batch sizes result from this approach,
potentially affecting the model's convergence speed. Nevertheless, the net effect is a more efficient
training process due to the faster execution of additional iterations compared to waiting for slower
workers.
      </p>
      <p>2. Flexible Rapid Reassignment (FlexRR): FlexRR is another technique that can be employed to
mitigate the slowdown caused by a worker lagging behind. In this approach, a slower worker can
delegate a portion of its tasks to faster workers. This redistribution of workload allows the slower
worker to catch up with the overall progress of the other workers, ultimately reducing the training time.</p>
      <p>
        These strategies are aimed at addressing the inherent challenges of synchronous updates in
distributed training, improving overall training efficiency, and mitigating delays caused by slower
participants in the process [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Asynchronous updates operate on a different principle. In this approach, the main server updates
and serves the model to all workers each time a worker delivers a gradient update. However, this method
can introduce race conditions because updates are computed on a worker using a model state that may
already be obsolete. The feasibility of asynchronous updates depends on the extent to which the weights
updated by concurrent workers are distinct.</p>
      <p>
        Fortunately, this approach proves effective in scenarios involving very large models, where only a
small fraction of the model's weights are updated with each gradient update. In such cases, the
convergence achieved with asynchronous updates is comparable to that of synchronous updates. What
sets asynchronous updates apart is their higher computing efficiency, as they allow for concurrent
updates from workers, optimizing the use of computational resources [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>While synchronous updates might initially seem sluggish, their speed can be significantly improved
with some additional engineering efforts, enabling them to match the efficiency of asynchronous
updates. In contrast, asynchronous updates are simpler to implement but hinge on the sparsity of weight
gradients.</p>
      <p>Successful distributed training revolves around meticulous parameter tuning. Once you've settled on
your model update strategy, fine-tuning the worker's batch size and the learning rate is of paramount
importance. This process differs from the gradient descent algorithm on a single machine. In the case
of synchronous updates, the model remains unaltered until all batches are collected from the workers.
This effectively creates a virtual batch size (B) calculated as B = N x b, where N represents the number
of workers and b signifies the batch size on a single machine. It's crucial for b to be sufficiently large
to compute substantial gradients, and N can also be substantial, especially when tackling large-scale
problems.</p>
      <p>This reduces the number of steps required in an epoch without significantly accelerating the loss
reduction process. To counteract this, you can use a higher learning rate, but this might come at the
expense of convergence stability. Thus, meticulous tuning becomes crucial to fully harness the
advantages of parallelization.</p>
      <p>In the context of selecting a gradient update strategy, the next pivotal step is the choice of a
framework to facilitate the integration of parallelization techniques within your machine learning
algorithm. For the purpose of this discussion, we will focus on the utilization of the Horovod
framework.</p>
      <p>Horovod stands as a distributed deep learning framework that is compatible with popular libraries
such as PyTorch and Keras. It boasts adaptability to various infrastructure backends, including
Kubernetes, Ray, and Spark.</p>
      <p>Our analysis will revolve around an illustrative example drawn from the Horovod repository,
demonstrating MNIST training.</p>
      <p>In the Horovod framework, several essential functions should be employed to enable distributed
model training:
● hvd.DistributedOptimizer: This distributed optimizer encapsulates a Torch
optimizer, offering a synchronized method for gathering and subsequently reducing gradients.</p>
      <p>Ultimately, it facilitates model updates.
● hvd.broadcast_parameters: This function serves to broadcast parameters (including
state_dict, named_parameters, and parameters) to ensure uniform model updates across worker
processes.
● hvd.broadcast_optimizer_state: It extends the capability to broadcast the optimizer
itself to the nodes, guaranteeing that gradient updates are computed with the correct learning
rate.
● hvd.allreduce: This function is integral for performing averaging or summation across all
processes running on the worker nodes. It plays a critical role in the DistributedOptimizer and
should be invoked whenever a value necessitates computation on worker processes and
subsequent reduction on the primary server.
● hvd.rank: This function provides a unique identifier for the process running on the worker,
facilitating differentiation between various worker processes.
● hvd.size: It offers insight into the total number of Horovod processes distributed across all
worker nodes.</p>
      <p>A crucial aspect of the data distribution process is managed by the PyTorch DistributedSampler:
● torch.utils.data.distributed.DistributedSampler: This component
undertakes the task of sampling a subset of the data, specifically associated with the process
rank.</p>
      <p>With these tools at our disposal, we are well-equipped to construct a distributed training workflow
for a PyTorch script. Furthermore, we may explore other distributed training approaches tailored to
specific machine learning frameworks, such as PyTorch's distributed data parallel training with
nn.parallel.DistributedDataParallel or Tensorflow/Keras' tf.distribute framework.</p>
      <p>Horovod offers synthetic benchmarks for TensorFlow v1, TensorFlow v2, and PyTorch, facilitating
performance and scalability evaluation in your specific environment. These benchmarks serve as a
valuable tool for assessing Horovod's capabilities and diagnosing performance issues. It's recommended
to run synthetic benchmarks initially to isolate potential problems not related to the training script itself.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusion</title>
      <p>In the world of machine learning, distributed training stands as a powerful approach for enhancing
scalability, allowing machine learning algorithms to effectively process massive datasets across
multiple machines. The benefits of distributed training are particularly evident in scenarios involving
extensive time-consuming training processes, storage constraints, data localization requirements, and
RAM limitations. With the increased availability of computational resources and greater bandwidth,
distributed training has emerged as a crucial solution to address these challenges.</p>
      <p>Data parallelism facilitates the collaborative training of machine learning models by distributing
data across multiple machines. This approach includes various key steps such as model initialization,
model distribution, gradient calculation, gradient communication, and model updates. Data parallelism
enables efficient model training on extensive datasets, harnessing the collective computational power
of multiple machines. Our experiments have consistently shown that parallel processing significantly
enhances the speed and efficiency of data pipelines, datetime feature creation, and feature engineering.
This speed improvement is most pronounced when dealing with computationally intensive operations.
Our results have demonstrated that parallelism offers a promising avenue for addressing the scalability
and efficiency challenges in modern data-driven applications.</p>
      <p>While distributed training offers immense benefits, it also brings inherent complexities. Strategies
for handling synchronous and asynchronous updates, such as the use of backup workers and Flexible
Rapid Reassignment (FlexRR), play a critical role in mitigating delays caused by slower workers and
optimizing training efficiency. Successful distributed training requires meticulous parameter tuning,
involving adjustments to the worker's batch size and learning rate, to fully harness the advantages of
parallelization. Parallel processing can benefit the training speed in both cases, but the bump in the
speed depends on the input size (the bigger size - the better speed improvement). Generally speaking,
we can see 2x speed boost with allocating 4x resources.</p>
      <p>Choosing a gradient update strategy is a pivotal step in the distributed training process. The selection
of an appropriate framework can significantly impact the integration of parallelization techniques into
your machine learning algorithm. The Horovod framework, compatible with popular libraries like
PyTorch and Keras, offers essential functions for distributed model training, such as
DistributedOptimizer, broadcast_parameters, broadcast_optimizer_state, allreduce, rank, and size.
These functions streamline the process of data distribution, enabling a seamless distributed training
workflow for machine learning tasks. This technique can assist to spread (parallelize) the learning
process across multiple machines. These techniques are essential for achieving rapid data processing,
model training, and feature engineering, enabling organizations to develop more effective anomaly
detection solutions and machine learning models in shorter timeframes.</p>
    </sec>
    <sec id="sec-9">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Tal</given-names>
            <surname>Ben-Nun</surname>
          </string-name>
          and
          <string-name>
            <given-names>Torsten</given-names>
            <surname>Hoefler</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>52</volume>
          ,
          <issue>4</issue>
          ,
          <string-name>
            <surname>Article 65</surname>
          </string-name>
          (
          <year>July 2020</year>
          ),
          <volume>43</volume>
          pages. https://doi.org/10.1145/3320060
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Aach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inanc</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          et al.
          <article-title>Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks</article-title>
          .
          <source>J Big Data</source>
          <volume>10</volume>
          ,
          <issue>96</issue>
          (
          <year>2023</year>
          ). https://doi.org/10.1186/s40537-023-00765-w
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Esmaeilzadeh</surname>
            ,
            <given-names>Soheil</given-names>
          </string-name>
          &amp; Salajegheh, Negin &amp; Ziai, Amir &amp; Boote,
          <string-name>
            <surname>Jeff.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Abuse and Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hegde</surname>
            , Vishakh and
            <given-names>Sheema</given-names>
          </string-name>
          <string-name>
            <surname>Usmani</surname>
          </string-name>
          .
          <article-title>“Parallel and Distributed Deep Learning</article-title>
          .” (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>William</given-names>
            <surname>Fedus</surname>
          </string-name>
          , Barret Zoph, &amp;
          <string-name>
            <surname>Noam</surname>
            <given-names>Shazeer.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Low</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yucheng</surname>
          </string-name>
          , et al.
          <article-title>"Graphlab: A new framework for parallel machine learning</article-title>
          .
          <source>" arXiv preprint arXiv:1408</source>
          .
          <year>2041</year>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jianmin</given-names>
            <surname>Chen</surname>
          </string-name>
          , Xinghao Pan, Rajat Monga, Samy Bengio, &amp;
          <string-name>
            <surname>Rafal</surname>
            <given-names>Jozefowicz.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Revisiting Distributed Synchronous SGD</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Wang</surname>
            , Hao,
            <given-names>Di</given-names>
          </string-name>
          <string-name>
            <surname>Niu</surname>
            , and
            <given-names>Baochun</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>"Distributed machine learning with a serverless architecture."</article-title>
          <source>IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Mu.</given-names>
          </string-name>
          <article-title>"Scaling distributed machine learning with system and algorithm co-design." Santa Clara</article-title>
          , CA, USA: Intel (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Peteiro-Barral</surname>
          </string-name>
          , Diego, and
          <string-name>
            <surname>Bertha</surname>
          </string-name>
          Guijarro-Berdiñas.
          <article-title>"A survey of methods for distributed machine learning</article-title>
          .
          <source>" Progress in Artificial Intelligence</source>
          <volume>2</volume>
          (
          <year>2013</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Konečný</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jakub</surname>
          </string-name>
          , et al.
          <article-title>"Federated optimization: Distributed machine learning for on-device intelligence</article-title>
          .
          <source>" arXiv preprint arXiv:1610.02527</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Joost</surname>
            <given-names>Verbraeken</given-names>
          </string-name>
          , Matthĳs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, &amp;
          <string-name>
            <surname>Jan S. Rellermeyer</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A Survey on Distributed Machine Learning</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>53</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Xing</surname>
            ,
            <given-names>Eric P.</given-names>
          </string-name>
          , et al.
          <article-title>"Strategies and principles of distributed machine learning on big data</article-title>
          .
          <source>" Engineering 2</source>
          .2 (
          <year>2016</year>
          ):
          <fpage>179</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tim</surname>
          </string-name>
          , et al.
          <article-title>"MLbase: A Distributed Machine-learning</article-title>
          <source>System." Cidr</source>
          . Vol.
          <volume>1</volume>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jeffrey</surname>
          </string-name>
          , et al.
          <article-title>"Large scale distributed deep networks</article-title>
          .
          <source>" Advances in neural information processing systems</source>
          <volume>25</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Almasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gottlieb</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>1989</year>
          ).
          <article-title>Highly Parallel Computing</article-title>
          .
          <article-title>Benjamin-Cummings Publishing Co</article-title>
          ., Inc..
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Farkas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kertész</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Lovas</surname>
          </string-name>
          ,
          <article-title>"Parallel and Distributed Training of Deep Neural Networks: A brief overview,"</article-title>
          <source>2020 IEEE 24th International Conference on Intelligent Engineering Systems (INES)</source>
          , Reykjavík, Iceland,
          <year>2020</year>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>170</lpage>
          , doi: 10.1109/INES49302.
          <year>2020</year>
          .
          <volume>9147123</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stolfo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>Toward Parallel and Distributed Learning by Meta-Learning</article-title>
          .
          <source>In Proceedings of the 2nd International Conference on Knowledge Discovery in Databases</source>
          (pp.
          <fpage>227</fpage>
          -
          <lpage>240</lpage>
          ). AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Bekkerman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bilenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Langford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Scaling up Machine Learning: Parallel and Distributed Approaches</article-title>
          .
          <source>In Proceedings of the 17th ACM SIGKDD International Conference Tutorials. Association for Computing Machinery.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Lazarevic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Obradovic</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <article-title>Boosting Algorithms for Parallel and Distributed Learning</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          <volume>11</volume>
          ,
          <fpage>203</fpage>
          -
          <lpage>229</lpage>
          (
          <year>2002</year>
          ). https://doi.org/10.1023/A:1013992203485
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>