<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of Image</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/ISCC53001.2021.9631440</article-id>
      <title-group>
        <article-title>Victoria Vysotska1, †, Iryna Kyrychenko2, †, Vadym Demchuk3, † and Nadiia Babkova4, †</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepan Bandera street 12, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University «Kharkiv Polytechnic Institute»</institution>
          ,
          <addr-line>2, Kyrpychova str., Kharkiv, 61002</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Shift-Left Architecture, Big Data Streaming, Early Issue Detection</institution>
          ,
          <addr-line>Real-Time Monitoring, Processing Optimization, Performance Tuning, Parameter Tuning, Machine Learning, Real-time Configuration Adaptation, Stream Processing, Automatic Parameter Tuning</addr-line>
          ,
          <institution>System Optimization, Infrastructure Cost Savings</institution>
          ,
          <addr-line>Data Pipeline Management 1</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>25</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>This research article explores the integration of a Competency module within the Shift-Left Architecture framework tailored for Big Data streaming systems. This study demonstrates that the proposed module significantly enhances maintainability, enables early issue detection, and improves risk assessment. Leveraging automation, real-time monitoring, and proactive validation, the Competency module streamlines configuration management, optimises streaming pipelines and accelerates processing efficiency. Key findings reveal its capability to validate source configurations proactively, reconfigure processing engines, and propose enhancements, reducing manual intervention and minimising downtime. This research contributes a robust framework that strengthens the efficiency, reliability, and scalability of Big Data streaming architectures, offering valuable insights for implementing Shift-Left principles effectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The evolution of data processing architectures can be categorised into distinct generations, each
addressing the limitations of its predecessor.</p>
      <p>
        The Extract-Transform-Load (ETL) model's initial stage entails pulling raw data from on-premises
databases, processing it with limited storage and computational power, and storing the refined data
in a data warehouse. Although this method was suitable for its era, it faces notable drawbacks, such
as limited processing power, inadequate scalability, and challenges in efficiently storing and
analysing past data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In response to these issues, a second-generation architecture emerged, defined by the
ExtractLoad-Transform (ELT) model. This method allows raw data to be quickly fed into a scalable,
costefficient Data Lake, utilising a cloud-based, multi-tier medallion framework. Data transformation is
carried out with highly parallelised, cloud-optimised computing resources, offering exceptional
scalability and performance. The resulting data is channelled to downstream business applications,
supporting more robust analytics and profound insights [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Fig. 1 represents the diagram of an ETL and ELT data pipeline.</p>
      <p>
        Despite its advantages, the ELT paradigm comes with disadvantages that can prevent efficiency
and data quality [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:




      </p>
      <p>Processing data in batches often leads to delays and inconsistencies. In micro-batch
processing, extra steps are needed to align and integrate the data with its current state,
adding complexity to the workflow.</p>
      <p>Different teams — such as AI, Data Platform, and Marketing — frequently create independent
pipelines for the same data sources to support their respective systems. This redundancy
wastes resources and drives up operational costs.</p>
      <p>Without comprehensive documentation, "similar-but-slightly-different" pipelines multiply,
causing inefficiencies and making maintenance more challenging.</p>
      <p>Reverse ETL, widely used in contemporary setups like data warehouses, Delta Lakes, or Data
Lakehouses, adds further redundancy and data duplication, amplifying processing
inefficiencies.</p>
      <p>Data has become the most valuable asset in today's business landscape. To thrive in a highly
competitive environment, organisations require enriched, trustworthy, and contextualised data
delivered in near real-time to drive decision-making and innovation.</p>
      <p>
        The Shift-Left Architecture was developed as an innovative solution to tackle these challenges. In
the context of Big Data streaming, Shift-Left Architecture is a design pattern that repositions data
processing and governance nearer to the point of data origin, drawing inspiration from Shift-Left
Testing in software engineering. This approach prioritises real-time data processing as it is
generated, leveraging tools like Apache Kafka and Apache Flink to create immediate data products
for Data Warehouses (e.g., Snowflake), Data Lakes, and Data Lakehouses (e.g., Databricks). By
focusing on enriching, transforming, and constructing data products accurately at the source,
ShiftLeft adheres to the "build once, use many times" philosophy. It ensures that data products are
efficiently prepared and seamlessly reused downstream, minimising redundancy while enhancing
data quality across the enterprise [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The Shift-Left Architecture (see Fig. 2) reimagines Big Data workflows by shifting data processing
and governance closer to the source. Early data cleaning, aggregation, and enrichment guarantee
that downstream systems — whether analytical platforms like Data Lakes, Data Warehouses, and
Data Lakehouses or operational systems like microservices—receive well-structured, high-quality
data. It reduces repetitive processing, shortens time to value, and maximises business impact from
the outset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        A standout feature of Shift-Left is its alignment with the data mesh paradigm, enabled by
realtime data products. Integrating transactional and analytical workloads using technologies such as
Apache Kafka, Apache Flink, and Apache Iceberg delivers consistent, high-quality data across the
organisation. Streaming data can be processed instantly or fed into modern analytics platforms like
Snowflake, Databricks, or Google BigQuery, creating a cohesive foundation for AI and analytics
initiatives [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>Building on these strengths, Shift-Left Architecture emphasises the critical role of early issue
detection and risk management. By spotting potential problems at the data source, organisations can
address risks before they cascade into downstream systems, avoiding expensive rework and
operational setbacks. A specialised Competency Module could be embedded within the Shift-Left
framework to strengthen this proactive stance, focusing on issue resolution and risk evaluation. This
module would arm data engineers and analysts with the expertise and tools to detect, assess, and fix
data quality issues, schema discrepancies, and security risks at the source. Embedding this capability
early in the data lifecycle would bolster the resilience and effectiveness of the data ecosystem,
amplifying the transformative impact of the Shift-Left approach.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>A significant body of prior research has explored the ETL and ELT methodologies. These
investigations examined ETL and ELT workflows, identifying obstacles to handling vast amounts of
real-time data, with efforts aimed at enhancing efficiency, minimising delays, and guaranteeing
dependability — critical factors for industries like finance, healthcare, manufacturing, and
telecommunications.</p>
      <p>
        One of the well-known approaches in big data is ETL (Extract-Transform-Load), which has
become a standard for some time. Nishanth Reddy Mandala's research, "ETL in Data Lakes vs. Data
Warehouses" [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], reveals the benefits of the ETL method for Data Lake and Data Warehouse
architectures. Among strengths, some significant weaknesses include latency, delayed data
availability, rigid schema, lack of flexibility, and high maintenance and scalability costs.
      </p>
      <p>
        The ELT (Extract-Load-Transform) approach was introduced to address the ETL challenges. The
book "Delta Lake - Deep Dive" [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], written by Nikhil Gupta &amp; Jason Yip, uncovers crucial aspects of
the Lakehouse paradigm, considering the ELT approach for data pipelines. While ELT overcomes the
challenges of ETL, this multi-hop architecture still has weaknesses:
• Delayed Updates – the longer the data pipeline and the more tools involved, the more time
it takes to refresh the data product.
• Extended Time-to-Market – development work is duplicated because each business unit
must repeat identical or similar processing tasks instead of leveraging a centralised, curated
data product.
• Higher Expenses – analytics platforms thrive financially on compute usage rather than on
storage. The increased reliance of business units on tools like DBT boosts profits for analytics
SaaS providers.
• Redundant Work – many organisations utilise multiple analytics systems — such as various
Data Warehouses, Data Lakes, and AI platforms — resulting in repeated processing efforts
with ELT across these environments.
• Inconsistent Data – integration methods like Reverse ETL and Zero ETL, among others,
result in discrepancies between analytical and operational applications. Connecting a
realtime consumer or mobile app API to a batch-processing layer won't deliver uniform
outcomes.
      </p>
      <p>
        Another work, "A reference architecture for serverless big data processing" [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], addresses the
challenges posed by existing ETL/ELT architectures. It emphasises the importance of reducing
timeto-market for data products by leveraging serverless platforms due to their high scalability.
      </p>
      <p>
        Confluent Platform introduces Shift-Left architecture for Big Data to enhance data processing, as
this approach was initially applied to testing. The article "Shift Left: Headless Data Architecture" [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
considers the key advantages of the approach:
• Enhanced Data Processing Efficiency: Shift-Left reduces the need for extensive data transfers
by processing data closer to the source, thereby decreasing both the time and costs associated
with large-scale data movement.
• Real-Time Over Batch Processing: By enabling real-time or near-real-time data handling
rather than batch methods, Shift-Left supports applications that rely on current data, such
as predictive analytics, machine learning systems, and operational decision-making.
• Better Data Quality: Processing data closer to its origin enables early detection and
correction of issues, preventing quality problems from spreading downstream and ensuring
reliable, high-quality data.
• Integrated Workloads: The Shift-Left approach connects transactional (operational) and
analytical tasks, facilitating seamless real-time data sharing across applications for use cases
such as real-time inventory tracking and personalised customer interactions.
• Faster Innovation and Expansion: This architecture speeds up the delivery of data-driven
applications to the market, enabling businesses to deploy data products more quickly [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ].
      </p>
      <p>
        The next work, "Shift Left. Unifying Operations and Analytics With Data Products" by Adam
Bellemare [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], explores how the headless data architecture, termed "Shift-Left," bridges the gap
between operational and analytical data. The text evaluates existing data architectures — Delta Lake,
Data Warehouse, and Data Lakehouse — highlighting the challenges in transferring data from
operational to analytical systems. It identifies the key limitations of these architectures:
• Downstream consumers bear full responsibility for the ETL/ELT process. However, without
ownership of the source, systems must ensure that data remains relevant, available, and
consistent.
• These architectures are expensive, requiring significant data copying and processing power.
      </p>
      <p>They create redundant processing due to cross-team communication problems, and outages
or inconsistencies affect all downstream systems.
• It requires data reconciliation — restoring data quality after applying different quality gates
after de-formalising, restructuring, and enriching.
• Curated data is not reusable for operational workloads, as the analytical workloads are
optimised only for OLAP (online transaction processing systems).</p>
      <p>The author highlights another aspect: processing insufficient data is the weakest part of the
ShiftLeft architecture. Due to the architecture's reliance on immutable data streams, where data is
append-only and cannot be altered once written, handling corrupted data poses a significant
challenge. Unlike traditional ETL/ELT processes, which employ a "stop-the-world" approach to
remove, correct, and reprocess insufficient data, ensuring downstream systems receive consistent
records, streaming architectures lack this flexibility. Corrupted data in streams can lead to severe
consequences, including financial losses and irreversible business decisions. To mitigate these risks,
Bellemare proposes several strategies for managing and preventing insufficient data in streams:
• Prevention Strategy: Emphasising rigorous design, thorough testing, and robust validation
rules to ensure data integrity from the outset.
• Issue Correction Event Design: Establishing mechanisms to notify downstream services of
specific data updates, facilitating corrective actions without altering the original stream.
• Rewind, Rebuild, and Retry Strategy: When other methods fail, recreate data streams with
correct data as a last resort.</p>
      <p>Despite these strategies, the Shift-Left architecture remains highly sensitive to insufficient data.
While Bellemare underscores the importance of prevention, the proposed solutions do not entirely
address challenges arising from workload fluctuations or provide proactive measures to anticipate
and prevent potential incidents. A stronger focus on predictive analytics and adaptive workload
management could further enhance the architecture's resilience.</p>
      <p>Since the approach eliminates multiple stages for processing data and moves data extraction,
sanity checks, and enrichment closer to the source, incorrect data schemas and improper streaming
or processing settings can lead to several problems in the early processing stages, including
performance degradation and data loss. In the next chapter, we propose a new approach to extend
the existing Shift-Left Architecture with a module that will identify and mitigate potential issues
nearer to the source.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Competency Module of Shift-Left Architecture</title>
      <p>
        In Big Data, shift-left architecture refers to processing and governing data closer to its source rather
than moving it to a centralised location for processing, such as in traditional Extract-Transform-Load
(ETL) pipelines. This approach aligns with modern data strategies prioritising real-time processing,
cost efficiency, and data quality. It is particularly relevant in environments leveraging data streaming
technologies like Apache Kafka and Flink and platforms like Databricks and Snowflake, which
support decentralised processing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The strengths of Shift-Left Architecture are [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]:
 Data processing efficiency. By processing data near its source, Shift-Left reduces the need
for large-scale data movement, which is often costly and time-consuming.
 Real-time processing instead of batch enables real-time or near-real-time data processing,
which is critical for applications requiring up-to-date information, such as predictive
analytics, machine learning models, and operational decision-making.
 Improved data quality. So that issues can be identified and corrected closer to the source. It
prevents data quality problems from propagating downstream, ensuring only high-quality,
trustworthy data is used.
 Unified workloads. Shift-left architecture facilitates the unification of transactional
(operational) and analytical workloads. It allows for consistent real-time data sharing across
applications, enabling scenarios like real-time inventory management and personalised
customer experiences.
 Innovation and growth. The architecture is designed to speed up the time to market for
datadriven applications, ensuring that data products reach the business more quickly.
      </p>
      <p>
        While it has significant strengths, the following weaknesses have been identified through recent
analyses [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]:
 Complexity in implementation. Transitioning to Shift-Left Architecture often requires
significant changes to existing data pipelines and architectures. It can be complex and
resource-intensive, especially for organisations with legacy systems. Even if a new pipeline
is designed, some complex logic can be more safely implemented than streaming, as
streaming pipelines have multiple challenges.
 Dependency on advanced tools. Effective implementation relies on modern tools like data
streaming platforms (e.g., Apache Kafka, Flink) and cloud-native technologies. Hence, the
designed applications are unlikely to be cloud-agnostic.
 Balancing speed and quality. While Shift-Left emphasises real-time processing, ensuring data
quality remains a challenge. Organisations must embed robust quality checks at the source
to avoid compromising accuracy in favour of speed.
      </p>
      <p>Thus, the new architecture brings the following challenges:
 Source dependence. Since data is processed near its origin, the source system's quality,
format, and configuration are critical. If the source is poorly configured or ineffective (e.g.,
producing incomplete or erroneous data), it can directly impact the downstream processes.
 Configuration issues. Misconfigurations at the source and processing engine, such as
incorrect data schemas and improper streaming or processing settings, can lead to problems
in the early processing stages. These issues may not be easily mitigated since centralised
validation or transformation is less likely in modern architectures.
 The effectiveness of the source system is also crucial. If the source system is slow or
unreliable, it can bottleneck the entire pipeline, reducing the benefits of real-time processing.</p>
      <p>This section introduces a solution that addresses the existing challenges within Shift-Left
Architecture.</p>
      <sec id="sec-3-1">
        <title>3.1. Methodology of the Competency Module</title>
        <p>
          The module incorporates the Holistic Adaptive Optimisation Technique (HAOT), a method
engineered to overcome the shortcomings of conventional parameter tuning approaches for Delta
Lake [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. HAOT offers a thorough and flexible optimisation framework that employs machine
learning to persistently evaluate and adjust the configurations of sources, streaming engines, and
sinks dynamically in real-time. In this article, we apply the HAOT technique within Shift-Left
Architecture.
        </p>
        <p>HAOT functions by leveraging real-time performance feedback and making adaptive
configuration adjustments. The method encompasses the following essential steps:
 Ongoing data collection that tracks performance metrics of the streaming application — such
as throughput, latency, and resource usage — while also monitoring the configurations of
sources, streaming engines, and sinks.
 Employing machine learning algorithms to evaluate the gathered data and uncover
connections between the configurations of sources, streaming engines, and sinks and their
effects on application performance, including building a predictive model to assess the
performance outcomes of different configurations.
 Using insights from this relational analysis to dynamically tweak the configurations of
sources, streaming engines, and sinks for enhanced performance, guided by machine
learning models that forecast optimal settings based on current conditions.
 Continuously refining the machine learning models with fresh performance data as the
streaming application operates, enabling the system to adjust to evolving data trends and
operational environments, thereby maintaining effective and relevant optimisation over
time.</p>
        <p>We added a Competency module to the existing architecture to address the challenges mentioned
at the beginning of this section. This module offers the key benefits mentioned in Table 1.</p>
        <sec id="sec-3-1-1">
          <title>Automated configuration checks,</title>
          <p>performance analytics, and best practices
enforcement</p>
          <p>Real-time metrics tracking (e.g.,
Proactively validates monitors Prometheus), dead-letter queues, and early
in real time. anomaly detection
Implements data quality checks Schema validation, versioning strategies,
to identify risks early and continuous monitoring for bottlenecks</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Implementation Details of the Competency Module</title>
        <p>
          The new design brings the following components to the existing Shift-Left architecture (Fig. 3):
 Streaming sources represent the starting points of data streams, encompassing diverse
realtime data producers like IoT devices, social media updates, log files, or sensors. These sources
generate a steady flow of data fed into the streaming pipeline, which delivers data into the
system in real-time or near real-time.
 After collecting data from streaming sources, a streaming processing cluster manages it. This
cluster comprises distributed computing resources designed to handle large data volumes in
real-time. It performs various functions, including filtering, aggregating, transforming, and
analysing the streaming data. Operating continuously, the cluster frequently employs
parallel processing and fault-tolerant techniques to manage high data throughput and ensure
data accuracy. It is built to scale effectively to accommodate fluctuating volumes of incoming
data streams.
 Data Collection Service is tasked with retrieving and structuring data from diverse sources
for optimisation purposes, utilising connectors. The data may encompass performance
metrics, system logs, environmental factors, and other relevant information, depending on
the specific streaming pipeline. The Data Collection Service ensures that data is gathered
consistently, efficiently, and securely while preprocessing it — through actions like cleaning,
normalising, and transforming — to prepare it for analysis and modelling. This service is
vital, as the quality and applicability directly influence the optimisation process's success.
 The ML Model for Parameter Tuning is the central feature of the proposed HAOT
implementation. The model is engineered to analyse collected data, uncovering patterns,
correlations, and trends that might be missed during human observation. It can leverage a
range of algorithms — such as regression, classification, clustering, or deep learning —
tailored to the specific challenge. Trained on historical data, the model predicts outcomes,
enhances processes or delivers insights. As additional data is gathered and conditions evolve,
the model can be retrained or fine-tuned, allowing it to adapt to emerging patterns and boost
its precision and utility over time. For this service, we will implement a Long Short-Term
Memory (LSTM) model for parameter tuning due to the time-series nature of streaming data
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
 ML Model for Risk Assessment is a new component of the HAOT implementation proposed
by the Competency Module framework. This model analyses collected metrics, identifying
potential risks, anomalies, and vulnerabilities that may not be readily noticeable to human
analysts. It utilises various algorithms — such as anomaly detection, classification,
timeseries forecasting, and deep learning — customised to tackle specific risk-related challenges
in streaming pipelines. This module will be trained on historical data that includes
performance metrics, error logs, and configuration states, and the model will predict risk
probabilities, flag potential issues, and provide actionable mitigation insights [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ]. The
implementation of this module is beyond the scope of this article.
 The ML Model for Issue Detection is also a new component for HAOT implementation
provided by the Competency module, which is designed to evaluate and enhance data
pipelines within the Shift-Left Architecture for Big Data streaming. This model
incrementally analyses pipeline performance data to pinpoint inefficiencies, bottlenecks, and
areas for improvement that might otherwise go unnoticed. It employs diverse algorithms —
such as clustering, anomaly detection, regression, or reinforcement learning — tailored to
detect issues and recommend optimisations across the pipeline [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ]. As a previous
component, this also will be trained on historical and real-time data; the model identifies
suboptimal configurations, suggests alternative source setups (e.g., adjusting Kafka
partitions), proposes different processing engines (e.g., switching from Flink to Spark
Streaming), and even recommends architectural adjustments (e.g., adding redundancy).
        </p>
        <p>
          Implementing this module is also out of the scope of this paperwork.
 The Control Module functions as the central decision-maker in this framework. It leverages
the insights and suggestions provided by the ML model to make choices or modifications to
the system or process under optimisation. It includes tweaking parameters or refining
strategies as needed. The module is engineered to execute these adjustments in a deliberate,
trackable, and reversible way, facilitating ongoing monitoring and fine-tuning based on
performance feedback and shifting conditions. It acts as the bridge connecting the analytical
outputs of the ML model with the system's operational elements, ensuring seamless and
effective optimisation. By maintaining a structured approach, the Control Module guarantees
that changes enhance efficiency while preserving system stability. It is pivotal in translating
data-driven recommendations into practical, impactful actions [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
 The Control Centre is a critical component of the proposed Competency Module within the
HAOT implementation, designed explicitly for Shift-Left Architecture in Big Data streaming.
This web-based interface serves as the centralised hub where users can access and analyse
the outputs of the ML Model for Risk Assessment and Issue Detection. It provides actionable
insights that are intuitively and visually appealing, including identified risks (e.g., potential
bottlenecks or data quality issues) and detected inefficiencies (e.g., suboptimal source
configurations or processing engine recommendations). Designed for real-time interaction,
the Control Centre enables users to monitor pipeline health, review suggested optimisations,
and assess risk probabilities through dashboards, charts, and alerts. It helps users make
informed decisions by consolidating complex analytical results into clear, actionable
recommendations, bridging the gap between machine learning outputs and operational
responses. The interface also accommodates user feedback, allowing adjustments to be
flagged for the Control Module to implement, ensuring seamless integration with the broader
HAOT framework. Ultimately, the Control Centre enhances visibility and control, aligning
with the Shift-Left paradigm's emphasis on proactive management and early intervention in
streaming pipelines.
        </p>
        <p>The proposed Competency Module within the HAOT framework for Shift-Left Architecture in
Big Data streaming integrates a comprehensive set of components to enhance pipeline optimisation.
The Data Collection Service ensures high-quality, preprocessed data as the foundation for analysis,
directly impacting the effectiveness of subsequent processes. The ML Model for Parameter Tuning,
using an LSTM approach, facilitates dynamic configuration optimisation by uncovering patterns in
time-series data. Meanwhile, the ML Models for Risk Assessment and Issue Detection improve the
framework by proactively identifying risks and inefficiencies while offering tailored
recommendations for enhancement. The Control Module connects these analytical insights to
actionable outcomes, adjusting in a controlled and adaptive manner. Together, these components
create a cohesive system that boosts maintainability, encourages early issue detection, and fortifies
risk management, paving the way for efficient and reliable streaming pipelines. Although the
implementation details of the Risk Assessment and Issue Detection models exceed the scope of this
article, their conceptual integration emphasises the framework's potential for comprehensive
optimisation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>This article examines the proposed Competency Module, which implements HAOT specifically for
Shift-Left architecture. The key components of the experiment include the ML Model for Parameter</p>
      <sec id="sec-4-1">
        <title>Tuning, the Data Collection Service, and the Control Module.</title>
        <p>
          To evaluate the approach, we apply the following parameter-tuning statement. It can be defined
as finding optimal options using the arguments of the maxima (argmax) approach [
          <xref ref-type="bibr" rid="ref17">17, 18</xref>
          ]. Given the
pipeline job  that processes an input data stream  over cluster resources  and loads data to the
sink  , the parameter tuning can be considered as evaluating an optimal configuration 
, that
maximises performance metric function  over configuration space  :

= argmax  ( ,  ,  ,  ,  ).
        </p>
        <p>∈
(1)</p>
        <p>To assess the implementation, we chose data streaming pipelines designed using the Shift-Left
approach with the following technology stack:
•
•
•</p>
        <p>Sources: Confluent Kafka.</p>
        <p>Processing Engine: Kafka Streams on a Kubernetes Cluster.</p>
        <p>Sink: Confluent Kafka</p>
        <p>These pipelines operated for one month, generating comprehensive logs and metrics that enabled
us to select parameters using the machine learning model. Given the wide range of parameters these
technologies involve, training the model with every possible variable is currently unfeasible.
However, the main goal of this study is to assess whether this method is viable. The parameters
selected for the experiment are outlined in Table 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Description</title>
        <p>Defines several partitions of a Kafka topic.</p>
        <p>Ideally, this parameter would be
finetuned by the control module, but we will
bypass this step for now since it involves a
challenging task [19].</p>
        <p>Determines how log segments are
managed for a topic. The two main options
are "delete," which removes old log
segments after a retention period, and
"compact," which retains the latest value
for each key by removing older duplicates,
ensuring efficient storage use [19].</p>
        <p>Refers to the duration or size limit for
which messages are stored in a topic's log
before being eligible for deletion or
compaction, as defined by the CleanUp
policy [19].</p>
        <p>Specifies the minimum number of in-sync
replicas that must acknowledge a write to
succeed. In Kafka Streams, it affects the
durability of data written to output and
internal topics, ensuring consistency
based on the underlying Kafka topic
configuration [19].</p>
        <p>Sets the maximum time (in milliseconds)
the internal producer buffers data before
sending it to output topics, balancing
latency and throughput. The value is zero
by default, meaning records are sent
immediately [20].</p>
        <p>Specifies the compression codec for data
written to output topics. It controls the
internal producer's compression, with
possible values: none, gzip, snappy, lz4, or
zstd. The default value is none (no
compression) [20].</p>
        <p>Sets the maximum data size (in bytes) the
internal consumer fetches from input
topics in a single request. The default is 50
MB [20].</p>
        <p>Sets the maximum number of records the
internal consumer fetches from input
topics in a single poll() call. The default is
1 [20].</p>
        <p>Sets the maximum time (in milliseconds)
for the internal producer or consumer to
wait for a response from the broker before
timing out. The default is 30 seconds [20].
max.poll.interval.ms Sets the maximum time (in milliseconds)
for the internal consumer to block
between poll() calls before being
considered failed. The default value is 5
minutes [20].
minReplicas Specifies the minimum number of pod
replicas running for a given workload. It
ensures the application maintains at least
this many instances, even under low load,
for availability and resilience [21].
maxReplicas Specifies the maximum number of pod
replicas created for a workload. It sets an
upper limit on scaling to prevent excessive
resource usage under high load. [21]</p>
        <p>In the experiment, the following options were unchanged. The Kafka topic had 12 partitions (as
this parameter is not flexible yet in Kafka Architecture), and cleanUpPolicy was Delete (as the change
of this field requires recreating the topic). The ML model set the other options – the min/max number
of pod replica (minReplicas, maxReplicas), topic retention period, compression codec for records
(compression.type), the max size of bytes to fetch data from the source (fetch.max.bytes), max
number of records an individual consumer can pull from Kafka (max.poll.records), timeout of waiting
a response from broker (for consumer and producer, request.timeout.ms), the timeout of pulling
block of data from a topic (request.timeout.ms) and the max time for buffering data before sending
downstream (linger.ms).</p>
        <p>The input dataset had the characteristics mentioned in Table 3.</p>
        <p>The Kubernetes cluster had 16 CPU cores and 64 GB of memory in total. The performance metrics
were taken from the Confluent Control Centre, and application logs were used with Loki and
Grafana. Table 4 has the list of metrics and their destination [22, 23, 24].</p>
      </sec>
      <sec id="sec-4-3">
        <title>Source Destination</title>
        <p>Application Pod Extracted from Kubernetes, collected in Grafana
Application Pod Extracted from Kubernetes, collected in Grafana
Application Pod Extracted from Kubernetes, collected in Grafana</p>
        <p>Kafka Dashboard in Confluent Control Centre
The experiment setup:
 Base. The first part of the experiment involved running the pipeline with metrics collection
and parameter tuning activated solely for Kaka Streams. It served as the baseline for
performance metrics, where only the internal components of Spark Streaming were

optimised based on available data without external influences from other pipeline
components.</p>
        <p>HAOT Applied. In the second part, the experiment extended metrics collection and
parameter tuning to include the processing engine Kafka Streams and Kafka parameters as
a Source and Target (mentioned in Table 2). This approach represents the holistic application
of HAOT, where the optimisation technique is applied across the entire data pipeline rather
than in isolated segments.
5. Results
The runs for the first experiment were with default values for configurations (Table 5).</p>
        <p>The results of average metric values of the two runs ("Base" and "HAOT Applied") are presented
in Table 7.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Discussion</title>
      <p>In the experiment, CPU utilisation metrics increased, indicating that compute resources were used
more effectively. This improvement in computing power utilisation suggests that the system
manages workloads more efficiently under the modified configuration. Memory utilisation rose by
14%, a change due to adjustments in specific Kafka consumer parameters: linger.ms, fetch.max.bytes,
and max.poll.records. The increase in linger.ms likely allowed the producer to buffer more data
before sending, optimising throughput at the expense of higher memory usage. Similarly, raising
fetch.max.bytes enabled the consumer to retrieve larger data batches per fetch request while
increasing max.poll.records allowed more records to be processed per poll, collectively contributing
to greater memory demand.</p>
      <p>A significant outcome was the 97% reduction in Kafka lag per second, driven by an enhanced
record-pulling mechanism. This improvement resulted from tuning fetch.max.bytes and
max.poll.records, which enabled the consumer to retrieve more data in fewer, more extensive
requests, thereby reducing the frequency of polls and minimising lag. Additionally, adjustments to
request.timeout.ms and max.poll.interval.ms played a crucial role in mitigating network-related
issues. By extending the timeout thresholds, these parameters offered greater resilience against
delays caused by an overloaded Kafka broker, ensuring the consumer could wait longer for responses
without failing, thus stabilising performance under high load.</p>
      <p>The ML model guiding the experiment recommended higher values for several parameters, but
these values were limited: linger.ms, fetch.max.bytes, request.timeout.ms, max.poll.interval.ms, and
maxReplicas (from Kubernetes). For instance, a higher linger.ms improved batching efficiency while
increasing fetch.max.bytes and max.poll.interval.ms optimised data retrieval and processing
intervals. While these elevated values significantly boosted overall system performance—evidenced
by reduced lag and better resource utilisation—they also introduced a trade-off: the processing delay
for individual records increased. This latency resulted because larger batches (via linger.ms and
fetch.max.bytes) and extended timeouts (via request.timeout.ms and max.poll.interval.ms) prioritised
throughput over per-record responsiveness, potentially causing single-record processing to wait
longer in the pipeline. In the Kubernetes context, setting a higher maxReplicas in the
HorizontalPodAutoscaler (HPA) enabled the system to scale to more pod instances under peak
demand, enhancing throughput and fault tolerance.</p>
      <p>The experiment markedly improved Kafka consumer efficiency and system scalability, with CPU
and memory resources better utilised and Kafka lag nearly eliminated. However, the configuration's
focus on batch optimisation and network resilience, combined with increased replica scaling,
suggests a design favouring high-throughput workloads over low-latency, single-record processing.</p>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusions</title>
      <p>In summary, integrating a Competency module into the Shift Left architecture has significantly
enhanced service performance and stability, underscoring its value as a crucial improvement. This
enhancement enables earlier detection and improves the system's overall reliability. However,
challenges remain, particularly in achieving scalability, managing the complexity of machine
learning models, and maintaining an appropriate balance between optimisation frequency and
system stability. Looking ahead, efforts will focus on refining this approach to tackle these
challenges, enhancing the efficiency of the machine learning algorithms, and broadening its
application to a broader range of streaming platforms and use cases. These advancements are
expected to further augment the benefits of the Competency module within the Shift Left framework.
Furthermore, we will examine the second part of the Competency module — Risk Assessment and
Issue Resolution — to strengthen the framework. These developments are expected to further solidify
the Competency module's advantages within the Shift Left architecture.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Strengholt</surname>
          </string-name>
          , Data Management at Scale:
          <article-title>Modern Data Architecture with Data Mesh and Data Fabric, 2nd</article-title>
          . ed.,
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dulay</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <article-title>Streaming Data Mesh: A Model for Optimising Real-Time Data Services, 1st</article-title>
          . ed.,
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Gaurav</given-names>
            <surname>Ashok</surname>
          </string-name>
          <string-name>
            <surname>Thalpati</surname>
          </string-name>
          ,
          <article-title>Practical Lakehouse Architecture: Designing and Implementing Modern Data Platform at Scale,</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2024</year>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>381</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Confluent</surname>
          </string-name>
          , What is Shift Left?,
          <year>2025</year>
          .URL: https://www.confluent.io/learn/what
          <article-title>-is-shift-left.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kai</given-names>
            <surname>Waehner</surname>
          </string-name>
          ,
          <source>The Shift Left Architecture - From Batch and Lakehouse to Data Streaming</source>
          ,
          <year>2024</year>
          . URL: https://kai-waehner.
          <article-title>medium.com/the-shift-left-architecture-from-batch-and-lakehouseto-data-streaming-d1ea7306ea30.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Nishanth</given-names>
            <surname>Reddy</surname>
          </string-name>
          <string-name>
            <surname>Mandala</surname>
          </string-name>
          ,
          <article-title>ETL in Data Lakes vs</article-title>
          .
          <source>Data Warehouses, v. 1 of ESP Journal of Engineering &amp; Technology Advancements</source>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .56472/25832646/JETA-V1I2P123.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yip</surname>
          </string-name>
          , Delta Lake - Deep
          <string-name>
            <surname>Dive</surname>
          </string-name>
          ,
          <source>Databricks Data Intelligence Platform</source>
          , Apress, Berkeley, CA,
          <year>2024</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>88</lpage>
          . doi:
          <volume>10</volume>
          .1007/979-8-
          <fpage>8688</fpage>
          -0444-
          <issue>1</issue>
          _
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Werner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tai</surname>
          </string-name>
          ,
          <article-title>A reference architecture for serverless big data processing</article-title>
          , volume
          <volume>155</volume>
          <source>of Future Generation Computer Systems</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1016/j.future.
          <year>2024</year>
          .
          <volume>01</volume>
          .029.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Confluent</surname>
          </string-name>
          ,
          <source>Shift Left: Headless Data Architecture</source>
          ,
          <year>2024</year>
          . URL: https://www.confluent.io/blog/shift
          <article-title>-left-headless-data-architecture-part-2.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[10] An Overview of Shift Left Architecture</source>
          ,
          <year>2025</year>
          . URL: https://www.deltastream.
          <article-title>io/shift-leftarchitecture-an-overview/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <source>Shift Left Unifying Operations and Analytics With Data Products</source>
          ,
          <year>2024</year>
          . URL: www.confluent.io/resources/ebook/unifying-operations
          <article-title>-analytics-with-data-products.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Navdeep</given-names>
            <surname>Singh</surname>
          </string-name>
          <string-name>
            <surname>Gill</surname>
          </string-name>
          ,
          <article-title>Mastering Shift Left Architecture for Real-Time Data Products</article-title>
          ,
          <year>2025</year>
          . URL: https://www.xenonstack.com/blog/shift
          <article-title>-left-architecture-data-products.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vysotska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Kyrychenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Demchuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gruzdo</surname>
          </string-name>
          ,
          <article-title>Holistic Adaptive Optimization Techniques for Distributed Data Streaming Systems</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          , Vol-
          <volume>3668</volume>
          ,
          <year>2024</year>
          , ISSN 16130073. doi:
          <volume>10</volume>
          .31110/COLINS/2024-2/009. https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3668</volume>
          /paper9.pdf
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Trotter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <article-title>Forecasting a Storm: Divining Optimal Configurations using Genetic Algorithms and Supervised Learning</article-title>
          ,
          <source>IEEE International Conference on Autonomic Computing (ICAC)</source>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1109/ICAC.
          <year>2019</year>
          .
          <volume>00025</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karunasekera</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R</given-names>
            <surname>Buyya</surname>
          </string-name>
          ,
          <article-title>Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments</article-title>
          ,
          <source>IEEE Transactions on Parallel and Distributed Systems</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Kyrychenko</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tereshchenko</surname>
            ,
            <given-names>G.</given-names>
            Proniuk, G.
          </string-name>
          ,
          <string-name>
            <surname>Geseleva</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>"Predicate Clustering Method and its Application in the System of Artificial Intelligence"</article-title>
          , CEUR-WS,
          <year>2023</year>
          . V.3396, РР.
          <fpage>395</fpage>
          -
          <lpage>406</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Herodotou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <source>A Survey on Automatic Parameter Tuning for Big Data Processing Systems, ACM Computing Surveys (CSUR)</source>
          , Vol.
          <volume>53</volume>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>