<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Seed Scheduling in Fuzz Testing as a Markov Decision Process</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafael F. Cunha</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Müller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Rooijakkers</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Puck de Haan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fatih Turkmen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Secure Information Technology SIT</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TNO</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Groningen</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Coverage-guided Greybox Fuzzing (CGF) is an efective method for discovering software vulnerabilities. Traditional fuzzers, such as American Fuzzy Lop (AFL), rely on heuristics for critical tasks like seed scheduling, which often lack adaptability and may not optimally balance exploration with exploitation. This paper presents a novel approach to enhance seed scheduling in CGF by formalizing it as a Markov Decision Process (MDP). We detail the design of this MDP, including the state representation derived from fuzzer and coverage data, the action space encompassing seed selection and power assignment, and a reward function geared towards maximizing coverage and bug discovery. A Proximal Policy Optimization (PPO) agent is then trained to learn a scheduling policy from this MDP within the AFL++ fuzzer. Our investigation into this Deep Reinforcement Learning (DRL) based approach reveals that while the MDP formulation provides a structured framework, practical application faces significant challenges, including high computational demands for training and intensive hyperparameter tuning. The key contributions of this work are: (1) a concrete MDP formulation for the complex task of fuzzer seed scheduling, (2) an analysis of the inherent dificulties and trade-ofs in applying DRL to this specific domain, and (3) insights gained from the agent's learning process (or lack thereof), which inform the discussion on the suitability of DRL for this type of optimization problem in fuzzing. This research provides a foundational exploration of DRL for seed scheduling and highlights critical considerations for future advancements in intelligent fuzzing agents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Coverage-guided Greybox Fuzzers (CGFs), exemplified by tools like American Fuzzy Lop (AFL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], are
highly efective for uncovering software vulnerabilities. They operate by mutating a corpus of ‘seed’
inputs to generate new test-cases, retaining those that achieve new program coverage to guide further
exploration. While the general idea is straightforward, CGFs are often compute-intensive, and their
eficiency heavily relies on making intelligent decisions throughout the fuzzing process [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>A critical stage amenable to optimization is seed scheduling. This encompasses two interdependent
decisions: (i) selecting the next seed input from the corpus for mutation, and (ii) determining the amount
of fuzzing efort (or ‘energy’) to allocate to the chosen seed. This task inherently involves a trade-of
between exploration (testing diverse seeds to find new program areas) and exploitation (focusing on
seeds that have historically been productive). Current fuzzers predominantly employ heuristics for
these decisions, which, while often efective, may lack adaptability and can be suboptimal for specific
programs or evolving fuzzing campaign states.</p>
      <p>
        To address these limitations, this work investigates the application of Deep Reinforcement Learning
(DRL) to optimize seed scheduling in CGFs. While machine learning (ML) has increasingly found
applications in the fuzzing domain [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ], we specifically propose to formalize the seed scheduling
problem as a Markov Decision Process (MDP) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. An MDP provides a principled framework for
sequential decision-making, allowing an agent to learn a policy that maximizes a cumulative reward.
DRL, which combines Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with artificial neural networks (ANNs), ofers
the potential to learn complex policies in environments with large state and action spaces, such as those
encountered in fuzzing.
      </p>
      <p>This paper details our approach to modeling seed scheduling as an MDP and the subsequent attempt to
train a DRL agent, specifically using Proximal Policy Optimization (PPO), to learn an efective scheduling
policy. The primary contributions of this work are: First, we present a novel MDP formulation tailored
for CGF seed scheduling, elaborating on the design of the state space (derived from fuzzer telemetry and
coverage information), the action space (representing choices of seeds and their mutation energy), and
the reward function (aimed at encouraging new coverage and bug discovery). Second, we provide an
in-depth analysis of the significant challenges encountered when applying DRL to this specific fuzzing
problem. These challenges include the design of a Markovian state representation, the non-stationarity
of the fuzzing environment, the vast action space if selecting seeds directly, the practical dificulties of
online training within a live fuzzer, and the intensive hyperparameter tuning required for DRL agents.
Third, based on our experiences and the performance of the trained agent, we ofer insights into the
suitability and current limitations of DRL for tackling the seed scheduling task in its current common
CGF implementations. This research contributes to the understanding of how DRL can be applied to
complex decision-making problems in software security and highlights key areas for future research in
developing more intelligent and adaptive fuzzing systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>This section provides essential background on Coverage-Guided Fuzzing (CGF) and Markov
Decision Processes (MDPs). It then reviews related applications of Reinforcement Learning in fuzzing to
contextualize the novelty of our approach to seed scheduling.</p>
      <sec id="sec-2-1">
        <title>2.1. Coverage-Guided Greybox Fuzzing (CGF)</title>
        <p>
          Fuzzing is a dynamic software testing technique renowned for its efectiveness in finding bugs and
vulnerabilities [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Coverage-guided greybox fuzzers, such as AFL++ [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], represent a particularly
successful class. CGFs iteratively mutate a set of input files, known as ‘seeds’, to generate new test-cases.
These test-cases are executed against the Program Under Test (PUT), and feedback, typically in the
form of code coverage (e.g., edge coverage in a Control-Flow Graph (CFG)), is used to guide the fuzzing
process. Inputs that discover new coverage are added to the seed corpus for further mutation, aiming to
progressively explore more program states [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The selection of which seed to fuzz next and for how
long (seed scheduling) is a critical heuristic that significantly impacts CGF performance. Our work
focuses specifically on modeling this seed scheduling aspect within a CGF like AFL++.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Markov Decision Processes (MDPs) and Reinforcement Learning (RL)</title>
        <p>
          Reinforcement Learning (RL) is a paradigm where an agent learns to make optimal sequential decisions
by interacting with an environment and receiving scalar reward signals [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Many RL problems are
formalized using a Markov Decision Process (MDP) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. An MDP is typically defined by a tuple
ℳ = ⟨, , , ℛ, ⟩ , where  is the set of states,  is the set of actions, (′|, ) is the state
transition probability function, ℛ(, , ′) is the reward function, and  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is a discount factor.
The agent’s goal is to learn a policy (|) that maximizes the expected cumulative discounted reward.
Deep Reinforcement Learning (DRL) utilizes deep neural networks to scale RL to complex problems.
Our work employs DRL to learn a seed scheduling policy by modeling the process as an MDP, with a
key aspect being the definition of the program’s coverage map as the state  ∈ .
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Reinforcement Learning in Fuzzing</title>
        <p>
          The application of Reinforcement Learning (RL) to enhance fuzzing efectiveness is an active research
area [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Seed scheduling, in particular, significantly impacts fuzzer performance [
          <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14 ref15">11, 12, 13, 14, 15</xref>
          ],
making it a prime candidate for RL-based optimization.
        </p>
        <p>
          Existing ML-based approaches to seed scheduling have predominantly utilized Multi-Armed Bandit
(MAB) models [16]. These include T-Scheduler [17] using Thompson sampling [18], AFL-HIER [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
employing UCB1 [19], EcoFuzz [20] using adversarial bandits, LinFuzz [21] with contextual bandits
[22], and SLIME [23] utilizing UCB-V [24]. While efective, MABs typically optimize for immediate
rewards and do not inherently model the full state transitions of a fuzzing campaign.
        </p>
        <p>Deep Reinforcement Learning (DRL) has also been applied, primarily focusing on the mutation
strategy. For example, Böttinger et al. [25] utilized DRL to select mutation operators, and similar
DRL-based mutation optimization was explored by Zhang et al. [26] and Shao et al. [27]. These works
address a diferent aspect of fuzzing than seed scheduling.</p>
        <p>
          To our knowledge, Choi et al. [28] presented the only prior attempt to formalize seed scheduling
for CGF as an MDP and apply RL. Their approach modeled individual test-cases as states and used
Temporal-Diference TD(0) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to learn state values, guiding seed selection and energy assignment.
However, their formulation presents several challenges regarding standard MDP assumptions. For
instance, the action space size (number of seeds in the queue) can vary over time, potentially violating
the requirement for a fixed action set or a consistent mapping for policy approximation. Furthermore,
their reward mechanism and state transition definition (where multiple new inputs from one seed are
considered next states) could lead to non-Markovian dynamics and dificulties in ensuring convergence
of value estimates.
        </p>
        <p>This paper builds upon the promise of RL for fuzzing but specifically addresses the seed scheduling
task by proposing a novel MDP formulation where the global coverage map constitutes the state. This
aims to provide a more holistic and Markovian representation of the fuzzing campaign’s progress
for DRL-based policy learning, distinct from prior MAB approaches and addressing the formulation
challenges noted in previous MDP attempts for scheduling.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Our Approach: MDP for Fuzzing Seed Selection</title>
      <p>
        To address the limitations of heuristic-based seed scheduling in CGFs, we propose a novel approach
based on DRL. This involves formalizing the seed scheduling task—encompassing both seed selection
and mutation energy assignment—as an MDP. Our core contribution lies in this formulation, particularly
the use of the global coverage map as the primary state representation. We implement this within
AFL++ [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and train a DRL agent using the Proximal Policy Optimization (PPO) algorithm [29] to learn
an optimal scheduling policy.
      </p>
      <p>Central to our MDP formulation is the definition of the agent and its environment. As depicted
in Figure 1, the DRL agent embodies the seed scheduler. Its role is to observe the current state of
the fuzzing campaign and decide which seed to fuzz next and with what energy. The environment
comprises all other components of the fuzzer (e.g., mutation engine, AFL++’s core loop) and the Program
Under Test (PUT). The agent learns by receiving (state, reward) feedback from this environment.</p>
      <sec id="sec-3-1">
        <title>3.1. MDP Formulation Details</title>
        <p>
          Our MDP is defined by the tuple ℳ = ⟨, , ℛ, , ⟩ each of which we summarize below.
3.1.1. State Space ()
A critical design choice is the state representation  ∈ . The primary component is AFL++’s global
coverage bitmap (specifically, the virgin_bits array). This bitmap indicates which program edges
have been covered and, for each edge, which hit-count bins (e.g., hit 1x, 2x, 3x, 4-7x, etc. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]) have
been achieved. Each edge’s coverage status is typically represented by a byte, providing a fine-grained
view of exploration. Figure 2 visualizes this structure. Details of the hit-count bins are provided in
Appendix C. This global coverage map, of size  (the number of instrumented edges in the PUT),
forms a fixed-size vector [0x00, 0xFF]  and serves as the core observation for the agent.
        </p>
        <p>To allow the agent to potentially adapt its strategy over the duration of a fuzzing campaign, we
make the MDP time-aware [30] by including the remaining allocated fuzzing time (in seconds) as an
additional scalar component in the state . This aims to provide context about the urgency or remaining
opportunity for exploration. The combined state (coverage map + remaining time) is designed to be
Markovian, providing suficient information for decision-making.
3.1.2. Action Space ()
The agent’s action  at each time step  is a tuple (, ), controlling two aspects of seed scheduling:
• Edge Selection (): The agent selects an edge  from the set of currently covered edges in the
program’s Control-Flow Graph (CFG), as indicated by the state’s coverage map. This selected
edge  is then used as an index into AFL++’s top_rated list to retrieve the actual seed input to
be fuzzed. This indirect selection mechanism ensures that the action component for seed choice
is bounded by , the size of the coverage map, rather than the potentially unbounded and
dynamic size of the seed queue. Invalid action masking [31] is applied to the policy network’s
output to ensure that only currently covered edges can be selected.
• Mutation Energy (): The agent also selects a mutation energy  to be assigned to the chosen
seed. This energy value, determining the number of mutations to generate, is chosen from a
predefined discrete set, typically ranging from 10% to 300% of AFL++’s default energy allocation
for a seed.</p>
        <p>The full action space for a given state  is the Cartesian product of the available covered edges and the
set of discrete energy levels:</p>
        <p>.
() =  edge() ×  power(), for all  ∈ .
(1)
For some experiments, the agent was configured to only perform edge selection ( ), while relying on
AFL++’s default power scheduling mechanisms for .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3.1.3. Reward Function (ℛ)</title>
      <p>The reward function  is designed to encourage the agent to select seeds and energy levels that
eficiently lead to the discovery of new program behaviors. After the fuzzer completes a round of
mutations on the seed chosen via action  = (, ), the reward is calculated as the ratio of generated
test-cases that were marked "interesting" (i.e., triggered new edge coverage or hit a new hit-count bin)
to the total number of test-cases generated in that round:
 = |{ ∈ ′ |  is interesting}| ,
|′|
(2)
where ′ is the set of all test-cases generated from the seed associated with . A reward closer to 1
indicates a highly eficient scheduling decision for that step.</p>
    </sec>
    <sec id="sec-5">
      <title>3.1.4. Transition Probabilities () and Discount Factor ()</title>
      <p>The state transition probability function (+1|, ) is implicitly defined by the complex interactions
within the CGF (mutation, execution, coverage update) and the PUT. As is common in DRL applied to
complex systems,  is not explicitly modeled; the agent learns a policy in a model-free manner. We use
a discount factor  = 0.99 to value future rewards.</p>
      <sec id="sec-5-1">
        <title>3.1.5. Episode Definition and MDP Dynamics</title>
        <p>Theoretically, our MDP has a finite horizon, as an episode would terminate upon achieving complete
coverage of the PUT, which is a finite set of edges. However, in practical fuzzing scenarios, complete
coverage is rarely attained, leading to what is efectively a single, very long episode for the entire
fuzzing campaign. This presents a significant challenge for DRL algorithms, which typically learn from
experience aggregated over many shorter episodes.</p>
        <p>Furthermore, a key characteristic of our state representation (the global coverage map) is that it
is largely non-decreasing. Newly discovered coverage updates the bitmap, but existing coverage is
persistent. Consequently, the agent generally cannot revisit exact past states that represented a lesser
degree of coverage. This results in a predominantly unidirectional exploration of the state space, as
illustrated in Figure 3.</p>
        <p>To facilitate DRL training by providing more frequent learning updates and varied experiences
in the face of a potentially single, long fuzzing episode, we implemented a multi-queue strategy.
In this setup, each initial seed (or a small group of initial seeds) forms the basis of an independent
"fuzzing group." Each group maintains its own coverage map context relative to its initial seed(s), and
the DRL agent cycles through these groups (e.g., via round-robin scheduling). Interactions within each
group are then treated as part of a shorter, distinct episode for learning updates. New inputs found
to be interesting within a group’s context are added to that group’s queue. Furthermore, with a small
probability, a particularly fruitful input might be used to initialize an entirely new group. This approach
aims to provide the necessary episodic structure crucial for many DRL algorithms. A schematic of this
multi-queue mechanism is provided in Appendix E (Figure 8).</p>
        <sec id="sec-5-1-1">
          <title>3.2. DRL Agent Design and System Integration</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>3.2.1. Algorithm Choice: PPO</title>
        <p>We selected Proximal Policy Optimization (PPO) [29] as our DRL algorithm. PPO is a policy gradient
method well-regarded for its balance of sample eficiency, ease of implementation, and stable
performance across a range of tasks. It works by optimizing a clipped surrogate objective function designed
to prevent excessively large policy updates, thereby promoting more stable learning. PPO typically
incorporates Generalized Advantage Estimation (GAE) [32] for lower variance advantage estimates, a
value function loss to learn a state-value baseline, and an entropy bonus to encourage exploration. (The
detailed PPO objective functions are provided in Appendix A).</p>
      </sec>
      <sec id="sec-5-3">
        <title>3.2.2. Network Architecture and Training</title>
        <p>The policy (actor) network, which outputs probabilities for edge selection and a distribution for energy
choice, and the value (critic) network, which estimates the state value, are both implemented as
MultiLayer Perceptrons (MLPs). Each network consists of two hidden layers with 64 units per layer, utilizing
the hyperbolic tangent (tanh) as the activation function. The Adam optimizer [33] is employed for
updating the network parameters. Specific hyperparameters used for training, such as learning rates,
batch sizes, and PPO-specific coeficients, are detailed in Appendix B.</p>
      </sec>
      <sec id="sec-5-4">
        <title>3.2.3. Integration with AFL++ via a Custom Gym Environment</title>
        <p>To enable interaction between our Python-based DRL agent—built using the Proximal Policy
Optimization (PPO) implementation from the Stable-Baselines3 library [34]—and the AFL++ fuzzer (written in C),
we developed a custom environment compatible with the Gymnasium API [35]. This "Gym wrapper"
acts as an intermediary, managing Inter-Process Communication (IPC) through message queues. At
each step:
1. AFL++ pauses its main loop and sends the current state (coverage bitmap and remaining time) to
the Gym environment via IPC.
2. The Gym environment passes this state to the DRL agent.
3. The agent selects an action (chosen edge and energy level) based on its current policy.
4. This action is sent back to AFL++ via IPC.
5. AFL++ executes the fuzzing round with the agent-specified seed and energy, calculates the reward
based on the outcome, and determines the next state.</p>
        <p>6. The new state and reward are sent back to the agent for learning.</p>
        <p>This architecture, illustrated in Figure 4, allows the DRL agent to control AFL++’s seed scheduling. A
detailed description can be found in Figure 7 in the Appendix.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Experimental Evaluation</title>
      <p>This section details the experimental setup used to evaluate our DRL-based seed scheduling approach
and presents the results. The primary goal was to assess whether the PPO agent, guided by our MDP
formulation, could learn an efective seed scheduling policy and how its performance compared to
baseline AFL++ heuristics.</p>
      <sec id="sec-6-1">
        <title>4.1. Experimental Setup</title>
        <p>
          Target Program and Fuzzer. All experiments were conducted by fuzzing the pdftotext utility
(version 3.02 from the xpdf suite), a real-world program that parses PDF files and extracts text. We
used AFL++ version 4.10c [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as the base fuzzer, modified to integrate our DRL agent as described in
Section 3.2.
        </p>
        <p>DRL Agent Configuration. We used PPO as the seed scheduling agent. The neural network
architecture and key PPO hyperparameters are detailed in Section 3.2 and Appendix B, respectively. Training
utilized the multi-queue strategy (described in Section 3.1.5 and visualized Figure 8 in Appendix E) to
generate multiple learning episodes from a single fuzzing campaign.</p>
        <sec id="sec-6-1-1">
          <title>Experimental Variants.</title>
          <p>We evaluated the following configurations:
• AFL++ Default: Standard AFL++ using its default explore power schedule.
• DRL-Full: Our PPO agent controlling both seed selection (edge choice) and mutation energy
assignment.
• DRL-Explore: Our PPO agent controlling only seed selection (edge choice), while mutation
energy was determined by AFL++’s explore power schedule.</p>
          <p>Metrics and Procedure. Each fuzzing campaign was run for eight hours. Performance was primarily
evaluated based on edge coverage achieved over time. We also monitored executions per second, the
number of test-cases saved to the queue, and the maximum depth (number of ancestors) of test-cases.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>4.2. Results and Analysis</title>
        <sec id="sec-6-2-1">
          <title>4.2.1. DRL Agent Training Performance</title>
          <p>The training dynamics of the PPO agent were monitored throughout the eight hour fuzzing campaigns.
Figure 5 presents typical training metrics (value loss, policy loss, and explained variance) for four
experimental runs (two for DRL-Full, two for DRL-Explore) using the multi-queue setup with a horizon
of 32 steps and a learning rate of 0.001.</p>
          <p>As shown in Figure 5, while the state value loss (  ) sometimes exhibited temporary decreases, the
policy’s surrogate loss ( ) did not show clear convergence towards improvement. The explained
variance remained near-zero throughout the fuzzing campaigns for all DRL agent configurations. This
indicates that the value function was not accurately predicting returns, and consequently, the agent
struggled to learn a consistently better policy than random exploration or simple heuristics within the
allocated time and data.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>4.2.2. Comparative Fuzzing Performance</title>
          <p>Figure 6 compares key performance metrics—edge coverage over time, executions per second, corpus,
crash, and depth information—for the DRL variants against the AFL++ default over an eight hour period
on pdftotext.</p>
          <p>As shown in the first row of Figure 6, the default AFL++ configuration achieved approximately 4%
more edge coverage after eight hours compared to both DRL variants. The executions per second
(second row of Figure 6) were, as expected, higher for the default AFL++ due to the overhead introduced
by the DRL agent interaction and IPC.</p>
          <p>The DRL variants saved significantly more test-cases (approx. 4x for DRL-Full, 2x for DRL-Explore)
compared to default AFL++. This is primarily an artifact of the multi-queue training strategy: since each
"group" in the multi-queue maintains its own coverage context, inputs are deemed "new" or "interesting"
relative to that smaller context, leading to more frequent additions to the respective group’s queue.
Similarly, the maximum depth of test-case ancestry was higher for the DRL variants. However, these
efects on corpus statistics did not translate into superior overall coverage or bug finding performance
within the eight hour timeframe.
(a) Full DRL seed scheduling
(seed selection and power
scheduling).</p>
          <p>(b) DRL seed selection with</p>
          <p>AFL++’s explore power
schedule.</p>
          <p>(c) AFL++ with default
(explore power schedule).</p>
          <p>settings</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Discussion and Conclusion</title>
      <p>This paper explored the formalization of coverage-guided fuzzer seed scheduling as a Markov Decision
Process (MDP) and the application of Deep Reinforcement Learning (DRL) to learn an optimal scheduling
policy. Our primary contribution lies in the proposed MDP formulation, utilizing the global coverage
map as state, and the subsequent analysis of applying DRL to this complex, dynamic environment.</p>
      <p>Modeling seed scheduling as an MDP presents inherent challenges, including defining a Markovian
state representation that captures suficient history for reward calculation, and designing an action
space that is both expressive and bounded for neural network agents. Our approach addressed these
by using the coverage map and remaining time as state, and by mapping edge selections to seeds via
AFL++’s top_rated list. However, training the DRL (PPO) agent proved dificult. Key reasons include:
(1) the inability to pre-train policies due to the program-specific nature of the coverage map state; (2)
the efectively single, long episode of a typical fuzzing campaign, requiring the agent to generalize
rapidly; (3) the slow execution of actions (a full fuzzing round per seed), limiting data collection for
learning; and (4) the extensive computational resources required for efective hyperparameter tuning in
such a noisy, high-variance environment. Consequently, our implemented agent did not demonstrate
significant policy improvement over baseline heuristics.</p>
      <p>
        These challenges highlight a contrast with DRL applications for fuzzing mutation strategies, where
the action space is often smaller and knowledge might be more transferable. Simpler Reinforcement
Learning models, such as Multi-Armed Bandits (MABs) [
        <xref ref-type="bibr" rid="ref15">17, 23, 21, 20, 15</xref>
        ], ofer less modeling complexity
for seed scheduling by focusing on immediate rewards. While they forego explicit long-term planning,
the dificulty in defining predictable long-term consequences in the chaotic fuzzing environment makes
it debatable whether the complexity of a full MDP for scheduling is always warranted without further
advances in state representation or learning algorithms.
      </p>
      <p>Conclusion and Future Work. Our exploration suggests that while formalizing seed scheduling as
an MDP is feasible, efectively training current DRL agents on this formulation within practical CGF
setups faces substantial hurdles related to sample eficiency, generalization, and training stability. The
negative results underscore the dificulty of applying DRL to this specific fuzzing sub-problem with
current methods. Future work should focus on several avenues. Investigating more sample-eficient DRL
algorithms or model-based RL could be beneficial. For MAB-based approaches, systematic comparisons
of diferent bandit algorithms (including non-stationary and contextual variants [ 36]) and reward
definitions are needed. Exploring hybrid approaches that combine simpler learning models with more
expressive state features also holds promise. Ultimately, developing truly adaptive and intelligent seed
schedulers may require novel learning frameworks tailored to the unique characteristics of the fuzzing
process.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>We used generative AI tools (e.g., Gemini 2.5 Pro) to assist with language refinement and formatting
tasks such as table arrangement and phrasing improvements. All technical content and results were
developed by the authors.
[16] Q. Zhao, Multi-armed bandits: Theory and applications to online learning in networks, Springer</p>
      <p>Nature, 2022.
[17] S. Luo, A. Herrera, P. Quirk, M. Chase, D. C. Ranasinghe, S. S. Kanhere, Make out like a
(multiarmed) bandit: Improving the odds of fuzzer seed scheduling with t-scheduler, in: Proceedings of
the 19th ACM Asia Conference on Computer and Communications Security, 2024, pp. 1463–1479.
[18] W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the
evidence of two samples, Biometrika 25 (1933). doi:10.2307/2332286.
[19] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time analysis of the multiarmed bandit problem,</p>
      <p>Machine learning 47 (2002) 235–256.
[20] T. Yue, P. Wang, Y. Tang, E. Wang, B. Yu, K. Lu, X. Zhou, Ecofuzz: Adaptive energy-saving greybox
fuzzing as a variant of the adversarial multi-armed bandit, in: Proceedings of the 29th USENIX
Security Symposium, 2020.
[21] Y. Su, D. Xiong, Y. Wan, C. Shi, Q. Zeng, Linfuzz: Program-sensitive seed scheduling greybox
fuzzing based on linucb algorithm, IEEE Access (2024).
[22] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire, Contextual bandit algorithms with
supervised learning guarantees, in: Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2011, pp. 19–26.
[23] C. Lyu, H. Liang, S. Ji, X. Zhang, B. Zhao, M. Han, Y. Li, Z. Wang, W. Wang, R. Beyah,
Slime: Program-sensitive energy allocation for fuzzing, in: ISSTA 2022 - Proceedings of
the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022.
doi:10.1145/3533767.3534385.
[24] J.-Y. Audibert, R. Munos, C. Szepesvári, Exploration–exploitation tradeof using variance estimates
in multi-armed bandits, Theoretical Computer Science 410 (2009) 1876–1902.
[25] K. Böttinger, P. Godefroid, R. Singh, Deep reinforcement fuzzing, in: Proceedings - 2018 IEEE
Symposium on Security and Privacy Workshops, SPW 2018, 2018. doi:10.1109/SPW.2018.00026.
[26] Z. Zhang, B. Cui, C. Chen, Reinforcement learning-based fuzzing technology, in: Advances in
Intelligent Systems and Computing, volume 1195 AISC, 2021. doi:10.1007/978-3-030-50399-4_24.
[27] J. Shao, Y. Zhou, G. Liu, D. Zheng, Optimized mutation of grey-box fuzzing: A deep rl-based
approach, in: Proceedings of 2023 IEEE 12th Data Driven Control and Learning Systems Conference,
DDCLS 2023, 2023. doi:10.1109/DDCLS58216.2023.10166955.
[28] G. Choi, S. Jeon, J. Cho, J. Moon, A seed scheduling method with a reinforcement learning for a
coverage guided fuzzing, IEEE Access 11 (2023). doi:10.1109/ACCESS.2022.3233875.
[29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms,
arXiv preprint arXiv:1707.06347 (2017).
[30] F. Pardo, A. Tavakoli, V. Levdik, P. Kormushev, Time limits in reinforcement learning, in: 35th</p>
      <p>International Conference on Machine Learning, ICML 2018, volume 9, 2018.
[31] S. Huang, S. Ontañón, A closer look at invalid action masking in policy gradient algorithms,
in: Proceedings of the International Florida Artificial Intelligence Research Society Conference,
FLAIRS, volume 35, 2022. doi:10.32473/flairs.v35i.130584.
[32] J. Schulman, P. Moritz, S. Levine, M. Jordan, P. Abbeel, High-dimensional continuous control using
generalized advantage estimation, arXiv preprint arXiv:1506.02438 (2015).
[33] D. P. Kingma, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
[34] A. Rafin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable
reinforcement learning implementations, Journal of Machine Learning Research 22 (2021) 1–8.</p>
      <p>URL: http://jmlr.org/papers/v22/20-1364.html.
[35] M. Towers, J. K. Terry, A. Kwiatkowski, J. U. Balis, G. d. Cola, T. Deleu, M. Goulão, A. Kallinteris,
A. KG, M. Krimmel, R. Perez-Vicente, A. Pierré, S. Schulhof, J. J. Tai, A. T. J. Shen, O. G. Younis,
Gymnasium, 2023. URL: https://zenodo.org/record/8127025. doi:10.5281/zenodo.8127026.
[36] D. Wang, Z. Zhang, H. Zhang, Z. Qian, S. V. Krishnamurthy, N. Abu-Ghazaleh, Syzvegas: Beating
kernel fuzzing odds with reinforcement learning, in: Proceedings of the 30th USENIX Security
Symposium, 2021.</p>
    </sec>
    <sec id="sec-9">
      <title>A. PPO Objective Functions</title>
      <p>The Proximal Policy Optimization (PPO) algorithm [29] aims to maximize a clipped surrogate objective.
The Conservative Policy Iteration (CPI) loss is:</p>
      <p>ˆ
  () = E   old (|)
︂[   (|)</p>
      <p>︂]
 = Eˆ [︁()  ,
ˆ ˆ ]︁
where () =   old((||)) . PPO uses a clipped version:</p>
      <p>ˆ [︁
 () = E min(() ˆ, clip((), 1 − , 1 + )
ˆ) .</p>
      <p>]︁
The full loss includes a value function term   () = (()− 
targ)2 and an entropy bonus [  ]():
 +  + (, ) = Eˆ [︀  () − 
1  () + 2[  ]()︀] .</p>
    </sec>
    <sec id="sec-10">
      <title>B. PPO Hyperparameters</title>
    </sec>
    <sec id="sec-11">
      <title>C. AFL Hit-Count Bins</title>
      <p>
        Hit-count bins used by AFL [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
1
2
3
4
5
6
7
8
Bit
      </p>
      <p>Counts
1
2
3
4-7
8-15
16-31
32-127
128+</p>
    </sec>
    <sec id="sec-12">
      <title>D. Detailed DRL Process Overview</title>
      <p>Figure 7 provides a detailed flowchart of the DRL seed scheduling process as integrated into AFL++,
including the multi-queue mechanism and agent interaction points.</p>
    </sec>
    <sec id="sec-13">
      <title>E. Multi-Queue Strategy for Episode Generation</title>
      <p>To address the challenge of DRL training with typically long fuzzing campaigns, we employed a
multi-queue strategy. This approach, illustrated in Figure 8, divides the fuzzing process into shorter,
pseudo-independent episodes by managing separate "fuzzing groups," each with its own coverage
context.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zalewski</surname>
          </string-name>
          ,
          <article-title>American fuzzy lop (afl) fuzzer</article-title>
          , URl: https://lcamtuf. coredump. cx/afl (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fioraldi</surname>
          </string-name>
          ,
          <article-title>Fuzzing in the 2020s: novel approaches and solutions</article-title>
          ,
          <source>Ph.D. thesis</source>
          , Sorbonne Université,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abo-eleneen</surname>
          </string-name>
          , A. Palliyali,
          <string-name>
            <given-names>C.</given-names>
            <surname>Catal</surname>
          </string-name>
          ,
          <article-title>The role of reinforcement learning in software testing</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1016/j.infsof.
          <year>2023</year>
          .
          <volume>107325</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Deep learning for coverage-guided fuzzing: How far are we?</article-title>
          ,
          <source>IEEE Transactions on Dependable and Secure Computing</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1109/TDSC.
          <year>2022</year>
          .
          <volume>3200525</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jana</surname>
          </string-name>
          , Neuzz:
          <article-title>Eficient fuzzing with neural program smoothing</article-title>
          ,
          <source>in: Proceedings - IEEE Symposium on Security and Privacy</source>
          , volume 2019-May,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1109/SP.
          <year>2019</year>
          .
          <volume>00052</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jia</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>A systematic review of fuzzing based on machine learning techniques</article-title>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0237749</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Puterman</surname>
          </string-name>
          ,
          <article-title>Markov decision processes: Discrete stochastic dynamic programming</article-title>
          ,
          <source>WileyInterscience</source>
          ,
          <year>2008</year>
          . doi:
          <volume>10</volume>
          .1002/9780470316887.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          ,
          <source>IEEE Transactions on Neural Networks</source>
          <volume>9</volume>
          (
          <year>2005</year>
          ). doi:
          <volume>10</volume>
          .1109/tnn.
          <year>1998</year>
          .
          <volume>712192</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Manes</surname>
          </string-name>
          , H. Han, C. Han,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Egele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woo</surname>
          </string-name>
          ,
          <article-title>The art, science, and engineering of fuzzing: A survey</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>47</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1109/TSE.
          <year>2019</year>
          .
          <volume>2946563</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fioraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Eißfeldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heuse</surname>
          </string-name>
          , AFL++
          <article-title>: Combining incremental steps of fuzzing research</article-title>
          ,
          <source>in: 14th USENIX Workshop on Ofensive Technologies (WOOT 20)</source>
          , USENIX Association,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jana</surname>
          </string-name>
          ,
          <article-title>Efective seed scheduling for fuzzing with graph centrality analysis</article-title>
          ,
          <source>in: Proceedings - IEEE Symposium on Security and Privacy</source>
          , volume 2022-May,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1109/ SP46214.
          <year>2022</year>
          .
          <volume>9833761</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Böhme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. T.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roychoudhury</surname>
          </string-name>
          ,
          <article-title>Coverage-based greybox fuzzing as markov chain</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>45</volume>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1109/TSE.
          <year>2017</year>
          .
          <volume>2785841</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Böhme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Manès</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <article-title>Boosting fuzzer eficiency: An information theoretic perspective</article-title>
          ,
          <source>in: ESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1145/ 3368089.3409748.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Farkhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          , Meuzz:
          <article-title>Smart seed scheduling for hybrid fuzzing</article-title>
          ,
          <source>in: RAID 2020 Proceedings - 23rd International Symposium on Research in Attacks, Intrusions and Defenses</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning-based hierarchical seed scheduling for greybox fuzzing, in: 28th Annual Network and Distributed System Security Symposium</article-title>
          ,
          <string-name>
            <surname>NDSS</surname>
          </string-name>
          <year>2021</year>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .14722/ndss.
          <year>2021</year>
          .
          <volume>24486</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>