<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>I. Boryndo);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multicriteria structural-parametric synthesis of optimal hybrid CNN structure for image processing⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Illia Boryndo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victor Sineglazov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Aviation University</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Convolutional Neural Networks have become the standard for image classification tasks, yet their design remains a complex challenge due to the vast search space of possible architectures and the need to balance multiple conflicting objectives. This research introduces a multicriteria structural-parametric synthesis approach for the automated design of optimal hybrid CNN architectures, demonstrated on the task of gesture recognition. The proposed method utilizes an evolutionary algorithm that simultaneously optimizes the structure (layer types, connections, blocks) and hyperparameters (e.g., kernel size, activation functions) of CNNs based on a multi-objective fitness function. In this paper multi-objective fitness function was formulated. Our approach employs genetic operators such as modified crossover, mutation, and selection, leveraging incremental training and weight inheritance to accelerate search convergence. The synthesized hybrid CNN incorporates advanced modules such as squeeze-andexcitation blocks, spatial-channel squeeze convolutions, attention mechanisms, etc., enhancing qualitive criteria of the model. Comparison with existing approaches, including reinforcement learning-based NAS, NSGA-Net, and differentiable NAS (DARTS) were done. Experimental results on a gesture recognition dataset demonstrate that the proposed method outperforms manually designed networks and other automated architecture search techniques, achieving a 98.7% accuracy while maintaining low computational cost. Based on the experimental results it is proven that utilizing complex structural blocks instead of traditional layers with flexible configuration of fitness function for both qualitive and performant criteria shows significant improvement for resulting model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;structural-parametric synthesis</kwd>
        <kwd>convolutional neural networks</kwd>
        <kwd>genetic algorithm 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Convolutional Neural Networks (CNNs) have revolutionized image classification, achieving
stateof-the-art accuracy on tasks from general object recognition to specialized domains like hand
gesture recognition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, designing an optimal CNN architecture for a given task is
challenging
due to the
enormous search
space
of possible layer configurations and
hyperparameters. Traditionally, human experts crafted CNNs (e.g. ResNet, VGG) through trial and
error, but this manual process may not yield the best trade-off between accuracy and efficiency for
every application. Recent advances in Neural Architecture Search (NAS) aim to automate CNN
design, exploring architectures via reinforcement learning or evolutionary algorithms. In
realworld applications like gesture recognition, there is a pressing need for CNNs that are not only
accurate but also efficient in computation and memory, to enable real-time performance on limited
hardware. This research addresses these challenges by proposing a multicriteria
structuralparametric synthesis approach – a genetic algorithm-based
method that optimizes CNN
architectures (structure) and their hyperparameters (parameters) simultaneously under multiple
objectives. We focus on static hand gesture classification as a representative case study to
demonstrate the effectiveness of the proposed hybrid CNN design and optimization algorithm.
† These authors contributed equally.
      </p>
      <p>In this paper we will highlight and analyze influence of different structural components of
hybrid convolutional neural network (HCNN) and its configuration parameters on qualitive criteria
of model and usage of this information during structural-parametric synthesis of such models. The
main goal of this paper is to develop the evolutionary mechanism that will utilize structural
components of different CNN architectures to create model that will satisfy predefined
optimization criteria.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works and existing approaches</title>
      <p>
        CNN Architecture Optimization: The task of finding optimal CNN structures has been widely
studied in the last few years. Early NAS approaches employed reinforcement learning agents to
sequentially “build” neural network layers, as in Zoph and Le’s work that trained a recurrent
controller to maximize validation accuracy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Follow-up methods like NASNet introduced a
modular search space (searching for an optimal convolutional cell that is repeated) and achieved
record accuracy on CIFAR-10 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Alternatively, evolutionary algorithms (EA) have been used to
evolve neural network architectures (a concept known as neuroevolution) by treating network
design as a combinatorial optimization problem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Real et al. and others demonstrated that
genetic algorithms could evolve CNN topologies that rival human-designed models on image tasks,
through mutations (e.g. adding or removing layers) and crossover of high-performing networks.
Techniques such as Genetic CNN and NEAT variants allowed networks to grow in complexity over
generations, gradually improving accuracy on benchmarks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Recent evolutionary NAS methods
often incorporate modern tricks like network morphism (to reuse weights when altering structure)
and surrogate performance predictors to speed up the search.
      </p>
      <p>
        Hybrid and Advanced Architectures: Beyond pure NAS, researchers have explored hybrid CNN
architectures that combine different neural components or techniques. For example, in video-based
gesture recognition, CNNs have been combined with RNNs/LSTMs to capture spatial and temporal
features, yielding hybrid models that outperform single-stream CNNs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Attention mechanisms
and squeeze-and-excitation (SE) blocks have been plugged into CNNs to adaptively recalibrate
features, markedly improving performance in image classification tasks. Such modules (e.g.
Inception blocks, residual connections, SE blocks) can be considered as building blocks in an
architecture search space. Recent work shows that incorporating these blocks in NAS can produce
hybrid CNNs that leverage multi-scale feature extraction, channel attention, and other advanced
features. However, searching in a space of heterogeneous components is complex. Some
approaches simplify this by evolving at the level of repeating units or cells (block-based NAS)
rather than individual layers.
      </p>
      <p>
        Structural-Parametric Synthesis Methods: Traditional NAS optimizes the architecture while
training network weights via gradient descent for evaluation. Structural-parametric synthesis
refers to jointly optimizing the network’s structure and its parameters (or hyperparameters). Early
neuroevolution often evolved both weights and topology, but for modern deep CNNs this is
impractical due to high dimensionality of weights. Recent approaches strike a hybrid strategy: the
algorithm evolves the structure (and certain hyperparameters like layer sizes or learning rates), but
uses standard backpropagation to train weights for each candidate model during evaluation. Some
works integrate hyperparameter optimization (HPO) into NAS, treating learning rate,
regularization, or data augmentation settings as part of the search genome. For instance, genetic
programming has been used to evolve CNN architectures with variable depth, where each
individual’s gene encodes layer types and connections as well as tunable parameters. Similarly,
reinforcement learning NAS frameworks such as MnasNet and MONAS introduced reward
functions that include latency or power consumption alongside accuracy. These multi-objective
methods yield a Pareto front of optimal trade-off architectures – e.g. a set of models that achieve
the highest accuracy for a given complexity [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Our work builds on these ideas, extending multi-objective optimization to a broader set of
criteria and using an evolutionary algorithm to perform structural-parametric synthesis of a hybrid
CNN suited for gesture image classification. The goal of this paper is to define main criteria for
CNN synthesis such as accuracy, computational cost, model robustness, etc., analyze and extract
structural blocks of modern CNN architectures, modify the existing solution of evolutionary
algorithms and apply it to synthesize optimal HCNN architecture.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Statement</title>
      <p>Hybrid Convolutional Neural Networks have demonstrated significant improvements in
performance across various complex tasks by integrating the strengths of Convolutional Neural
Networks with other neural network architectures. But Optimizing CNN structures for image
recognition involves several challenges that we aim to address:</p>
      <p> Huge Search Space: The number of possible CNN architectures (varying in depth,
layer types, filter sizes, skip connections, etc.) is combinatorically large. Exhaustively
searching this space is infeasible; intelligent heuristics are needed to find good solutions
with limited trials.</p>
      <p> Multi-Objective Trade-offs: We seek not just high accuracy, but also efficiency in
terms of computational cost, model size, and inference speed. These objectives often conflict
with each other (e.g. increasing depth can improve accuracy but worsens speed). The
problem requires balancing multiple criteria to find an optimal compromise, rather than
optimizing a single metric. In gesture recognition, specifically, models must be small and
fast enough for real-time use (e.g. in an embedded system or AR/VR application) while
maintaining high accuracy</p>
      <p> Training and Evaluation Cost: Each candidate CNN architecture needs training (at
least partial) to evaluate its accuracy, which is time-consuming. Searching through many
candidates can thus be computationally expensive. The challenge is to reduce the cost per
evaluation (via weight inheritance, surrogate models, or partial training) and to converge to
good solutions in fewer generations/iterations.</p>
      <p> Domain-Specific Requirements: For gesture recognition, the CNN may need to
handle variations in hand shape, orientation, lighting, and backgrounds. The optimized
architecture should be robust to these variations. Moreover, if the system is to be deployed
on devices (like VR headsets or mobile phones for HCI), constraints on memory and
compute are strict. The optimization problem must accommodate such domain constraints
as part of the objective (e.g. limiting model size for embedded deployment).</p>
      <p>In summary, the core problem is to automatically synthesize a CNN architecture that meets
multiple performance criteria (accuracy and various efficiency metrics) for image classification,
demonstrated on a hand gesture dataset. This involves formulating a search algorithm capable of
navigating the vast design space efficiently and evaluating candidates under realistic conditions (as
one would face in deploying a gesture recognition system).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Analysis and assessment of modern CNN architectures and their functional blocks</title>
      <p>
        Convolutional Neural Networks (CNNs) have evolved through numerous architectural innovations,
with new structural blocks introduced to improve performance or efficiency. Modern CNN
architectures often incorporate specialized blocks – such as attention modules, depthwise separable
convolutions, inception modules, self-calibrated convolutions, dense connectivity, etc. – each
aiming to boost accuracy or efficiency. Evaluating the impact of these blocks on key metrics
(accuracy, computational cost, model size, training and inference speed) is crucial for
understanding trade-offs in design. In this work, we analyze several prominent structural blocks in
CNNs and assess their influence on model performance and efficiency. Using the CIFAR-100 image
classification dataset as a testbed (100 classes of 32×32 images), we compare how adding each block
to a baseline CNN affects accuracy, FLOPs (floating-point operations, a proxy for compute cost),
model parameters (size), training time per epoch, and inference latency. While experiments are on
CIFAR-100, the observed trends reflect general behaviors also reported on larger datasets like
ImageNet [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
      </p>
      <p>For performing this testing approach we analyzed modern CNN architectures and extracted
following set of structural blocks for further testing:</p>
      <p>Attention Mechanisms in CNNs: “Attention” mechanisms direct a network’s focus to the most
relevant features, improving representation of important content while suppressing less useful
information. In CNNs, attention can be applied in different forms: e.g. channel attention
(reweighting feature channels), spatial attention (highlighting important spatial regions), or
nonlocal/self-attention (capturing long-range dependencies). Prior studies have extensively shown that
adding attention modules to CNNs yields consistent accuracy improvements across various
architectures. Attention mechanisms generally improve accuracy by helping the network focus on
important features.</p>
      <p>Depthwise Separable Convolutions: Depthwise separable convolution is an efficiency-driven
block that factorizes a standard convolution into two stages: a depthwise convolution (applying a
single filter per input channel) followed by a pointwise convolution (1×1 filters to mix channel
information). This factorization drastically reduces the number of parameters and multiply-add
operations required, compared to a conventional convolution with the same filter size and
channels.</p>
      <p>
        Squeeze-and-Excitation (SE) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] blocks are a form of channel-wise attention to adaptively
recalibrate feature maps. An SE block “squeezes” global spatial information into a channel
descriptor (using global average pooling), then “excites” each channel with a learned weight to
emphasize informative features and diminish weak ones [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Inception Modules: A basic Inception module performs parallel convolutions of different sizes
(e.g. 1×1, 3×3, 5×5) and pooling on the same input, then concatenates their outputs. Importantly,
1×1 convolutions are used within the module for dimension reduction (i.e. bottlenecking) before
the more expensive 3×3 and 5×5 convs, drastically reducing the computational burden.</p>
      <p>Self-Calibration Convolution (SCConv) Block: Self-Calibration Convolution (SCConv) is a more
recent structural unit that aims to reduce feature redundancy in CNNs to improve efficiency.
SCConv explicitly factorizes a convolution into two cooperative parts: a Spatial Reconstruction
Unit (SRU) to handle spatial redundancy, and a Channel Reconstruction Unit (CRU) to handle
channel redundancy. The SRU “separates and reconstructs” feature maps – effectively a
transformation that processes different spatial parts and then recombines – while the CRU uses a
split-transform-fuse strategy on channels (somewhat analogous to group or depthwise convolution
but with learnable fusion).</p>
      <p>Densely Connected Layers: Densely Connected Convolutional Networks (DenseNets) feature
densely connected layers, where each layer receives as input all feature-maps from previous layers
(via concatenation). In a DenseNet block, layers are “densely” connected (in contrast to ResNet’s
additive identity connections) so that features are reused throughout the network. This
architecture encourages feature reuse and alleviates vanishing gradients, enabling very deep
networks to be trained efficiently. A key outcome of dense connectivity is that it achieves lower
error rates with significantly fewer parameters than traditional architectures. However, densely
connected layers come with some practical overhead. Because each layer concatenates all previous
outputs, the effective width (number of feature maps) grows throughout the network.</p>
      <p>Convolutional Block Attention Module (CBAM): CBAM is a lightweight attention module that
sequentially applies channel attention and spatial attention to a feature map. It can be regarded as
an extension of the SE block: first, CBAM computes a channel attention map, and applies it to the
features; then it computes a spatial attention map (using the channel-refined feature, by pooling
along channels and applying a convolution to find important spatial locations). CBAM yields a
boost in accuracy beyond what channel-only attention can provide.</p>
      <p>Other Common Structural Blocks: Residual Blocks (Skip Connections), Bottleneck Convolutions,
Group Convolutions and ResNeXt, Inverted Residuals (MobileNetV2/EfficientNet blocks), Spatial
Pyramid Pooling (SPP), etc.</p>
      <p>To empirically compare these blocks, assume a baseline CNN (e.g. a ResNet-like model) trained
on CIFAR-100. We evaluate the effect of adding each type of block (one at a time) to the network’s
architecture. The evaluation metrics are: Top-1 Accuracy delta on the CIFAR-100 test set, FLOPs
delta (forward-pass multiply-add operations for one image), Training Time, and Inference Latency
(single-image). For a fair comparison, each modified model is adjusted to have a similar depth so
that we isolate the effect of the block itself. Table 1 summarizes the qualitative results of this
comparative analysis, incorporating known findings from literature and observing trends during
the CIFAR-100 experiments.</p>
      <p>From the above comparisons, several general trends emerge. First, certain blocks primarily
target accuracy gains by enhancing the network’s representational power (e.g. attention modules,
dense connectivity), while others primarily target efficiency (e.g. depthwise separable conv,
SCConv, bottlenecks), and a few manage to achieve both (e.g. Inception, ResNeXt’s grouped conv,
SCConv to some extent). For instance, attention-type blocks (SE, CBAM) consistently improved
accuracy on CIFAR-100 by focusing on important features, with SE giving ~1-2% reduction in error
for almost no cost. Depthwise separable convolutions and inverted residuals showed massive
efficiency gains – our analysis agrees with the MobileNet results that you can shrink a model’s
FLOPs by an order of</p>
      <p>Baseline (no special block)</p>
      <sec id="sec-4-1">
        <title>Attention Mechanisms</title>
      </sec>
      <sec id="sec-4-2">
        <title>Depthwise Separable Conv</title>
      </sec>
      <sec id="sec-4-3">
        <title>Squeeze-and-Excitation (SE)</title>
      </sec>
      <sec id="sec-4-4">
        <title>Inception Module</title>
      </sec>
      <sec id="sec-4-5">
        <title>Self-Calibrated Conv (SCConv)</title>
      </sec>
      <sec id="sec-4-6">
        <title>Densely Connected (DenseNet)</title>
      </sec>
      <sec id="sec-4-7">
        <title>Conv. Block Attention (CBAM)</title>
      </sec>
      <sec id="sec-4-8">
        <title>Grouped Conv (ResNeXt)</title>
      </sec>
      <sec id="sec-4-9">
        <title>Bottleneck Conv</title>
        <p>Inverted Residuals
+1.6%
+0.92%
+2.37%
+1%
+1.07%
+1.65%
+0.56%
+1.5%
+0.1%
-0.3%
+ 1.5%
-2.3%
+0.52%
+1.3%
-6.75%
+1.7%
+7.94%
+1.4%
+1.1%
+0.2%
4.5
4.6
3.0
4.6
3.8
3.5
5.4
4.55
3.7
3.2</p>
      </sec>
      <sec id="sec-4-10">
        <title>Inference</title>
        <p>Speed
(ms/img)</p>
        <p>7.5
(reference)
7.6
3.8
7.5
6.5
5.2
8.3
7.6
6.9
5.4
3.9
magnitude and still get respectable accuracy. On CIFAR-100, using depthwise separable conv
allowed the model to be very small and fast, though a slight accuracy drop had to be compensated
by using more filters or layers.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Proposed structural-parametric synthesis approach of HCNN using evolutionary algorithm utilizing CNN structural blocks</title>
      <p>We propose a multi-criteria evolutionary algorithm for the structural-parametric synthesis of a
hybrid CNN, which simultaneously optimizes the network’s architecture and certain
hyperparameters. The algorithm is based on a Genetic Algorithm (GA) framework enhanced with
multi-objective selection and hybrid training. The main components of the approach are:
 Representation (Encoding): Each individual in the population encodes a CNN
architecture along with associated hyperparameters. We use a variable-length encoding to
allow flexible network depths An individual’s “gene” can be represented as a sequence of
layer descriptors. We will encode following data: A layer type (convolution, pooling, dense,
or special blocks like residual block, SE block, etc.), associated parameters for that layer (e.g.
filter size, number of filters, stride, activation function), connection information if
applicable (for example, whether a skip connection is applied). Additionally, we include
global hyperparameters such as initial learning rate or regularization factor as part of the
genome, so the algorithm can tune them.</p>
      <p> Initial Population: The GA starts with an initial population of 25 randomly
generated CNN architectures. Each is created by random sampling of layer types and
hyperparameters under certain constraints (such as a minimum and maximum network
length). The randomness injects diversity; for example, one initial individual might be a
shallow conventional CNN, while another might randomly include a residual block or an
LSTM layer (if exploring temporal hybrid models). This diverse start helps cover different
regions of the search space.</p>
      <p> Fitness Evaluation: Each individual (CNN architecture) is decoded into a network
model which is then trained on the task data (gesture images) for a certain number of
epochs (or until convergence) to obtain its performance metrics. We evaluate each model
on a validation set to measure multiple criteria: accuracy, computational cost, memory
usage, training time. The fitness function for CNN evaluation will be presented further into
paper.</p>
      <p>The evolutionary loop (evaluation -&gt; selection -&gt; crossover/mutation -&gt; next generation)
repeats for a number of generations until a stopping criterion is met. Because this is multicriteria
optimization, we define the stopping condition in terms of either a target threshold for each
objective or a stability criterion. For instance, we may stop when the improvement in the Pareto
front over 5 generations is below a small epsilon (i.e., the search has converged to a stable set of
solutions), or simply after a preset max number of generations if computational budget is
exhausted. At termination, the algorithm outputs the optimal architecture(s) found. In a scenario
with multiple Pareto-optimal solutions, a user can then pick a specific CNN that best fits their
desired trade-off (e.g. highest accuracy within a given memory limit). In our case, we identified one
particular architecture that offers an excellent balance for gesture recognition and designate it as
the final optimal CNN.
5.1. Formulating a multi-objective fitness function for CNN model evaluation
Given these metrics that we expect to use as evaluation criteria, we define a multi-criteria fitness
function. In this paper we offer to use vector-based evaluation fitness function. Let’s assume the
evaluation criteria into following representation:
max [ f 1( x ) ,−f 2( x ) , f 3( x ) , f 4 ( x )] ,
x∈ Ω
where each f n( x ) represents one criteria.</p>
      <p>To reduce the criteria to the same scale, we perform normalization:
(1)
z = f i ( x )− zimin
i zmax− zimin , i=1,2,3,4 , (2)</p>
      <p>i
where zimaxand zimin the worst and best values for each criterion in the current population. Next,
we need to define the reference vectors. The reference vectors X are directions in the space of
Mdimensional objective functions and are set using a uniform distribution and are determined to
cover the entire objective space. Their number depends on the dimensionality of the space and the
desired density of the solution distribution. The vectors are chosen so that they correspond to a
uniform distribution of the desired trade-offs between the criteria.</p>
      <p>The number of vectors is calculated by the formula:</p>
      <p>K =( H M+M−−11) (3)
where H - uniformity parameter (determines the density of vectors), M = 4 - number of criteria.</p>
      <p>Next, for each decision ziin the normalized objective space, we calculate the scalarized value. It
is the transformation of a multi-objective problem into a number of scalar subproblems that is
achieved by using reference vectors and scalarization, which takes into account both convergence
to the Pareto front and uniform distribution of solutions in the objective function space.</p>
      <p>Before calculating the fitness, each solution z (a vector of objective function values normalized
to eliminate scale differences) is associated with the nearest reference vector v j. This provides a
link between the solution and the region in the objective space that this vector represents. The
closest vector is selected by projecting z onto the direction of v j, which minimizes the angle
between them. Formula for finding the associated vector:
j¿=arg ⁡max cos ⁡θ j , cos ⁡θ j=
j
∥ z∥ ‖v j‖
where cos θ jis the cosine of the angle between the decision vector z and the reference vector v j.</p>
      <p>A scalarized fitness function is used to evaluate the suitability of each solution with respect to
its reference vector v j. It has two components:</p>
      <p>Convergence to the Pareto front: It is checked by projecting the solution onto the direction of
the vector v j, which determines how close the solution is to the ideal point for this vector.</p>
      <p>Solution diversity: Evaluated by taking into account the distance between the solution z and its
projection on v j, which contributes to an even distribution of solutions.</p>
      <p>The formula for the scalarized fitness function looks like this:
z⋅ v j</p>
      <p>, (4)</p>
      <p>S ( z , v j)=vTj ⋅ z + α ⋅ ∥ z∥ ⋅ ‖v j‖, (5)
where, vTj is the projection of the solution onto the reference vector, which reflects the
convergence to the Pareto front, ∥ z∥ ⋅ ‖v j‖ are the lengths of the vectors z and v jthat help to
estimate the difference between them, α is an adaptive coefficient that increases over time,
changing the emphasis between convergence and diversity.</p>
      <sec id="sec-5-1">
        <title>5.2. Defining an individual</title>
        <p>For defining individual that will incapsulate synthesized CNN architecture based on defined criteria
we offer following approach. First let’s define properties that will be incapsulated. Genome
structure is considered to encapsulate the following:
 number of layers;
 types of layers/blocks (SCConv, SE-BE-Inc, Dense block, standard convolutional,
pooling, 1x1, batch normalization, etc.);
 kernel sizes;
 number of filters;
 stride and padding;
 activation functions;
 block-related specific parameters;
 learning rate, batch size, etc.</p>
        <p>As the encapsulation instrument we offer to use JSON format with reference object mapper
implementation. Simplified example of the genome could be the following and represented in the
unified JSON format:
[{"type": "ConvD", "filters": 64, "kernel_size": 2, "stride": 1, "padding": "same", "activation": "relu"},
{"type": "Dense-Block", "num_layers": 3, "growth_rate": 12, "bottleneck_size": 4},
{"type": "SE-Block", "reduction_ratio": 8},
{"type": "SC-Conv", "filters": 32, "kernel_size": 2, "stride": 1, "padding": "same"},
{"type": "Pooling", "pool_size": 2, "stride": 2, "pool_type": "Max-Pooling"},
{"type": "FC", "units": 20, "activation": "softmax"}].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiment</title>
      <sec id="sec-6-1">
        <title>6.1. Settings and results</title>
        <p>Experimental Setup: We evaluated the proposed structural-parametric synthesis algorithm on a
real-world hand gesture image classification task. The dataset consists of a collection of hand
gesture images spanning 10 classes (such as numeric digits shown by fingers, or common sign
language letters), captured under varying backgrounds and lighting to mimic real-world
conditions. We used 80% of the data for training (with 20% of training set aside as a validation set
for the algorithm’s fitness evaluation) and 20% for final testing. The evolutionary algorithm was
configured with a population size of 20 CNN architectures per generation, evolving for up to 30
generations or until convergence. Each CNN candidate was trained for a short 5 epochs on the
training set to obtain its validation accuracy (this early stopping was sufficient to gauge relative
performance). The optimization criteria and their weights were set as follows: accuracy (40%),
FLOPs (20%), parameter count (15%), memory usage (15%), and training time (10%) – reflecting an
emphasis on accuracy while still strongly penalizing resource-heavy models. All experiments were
run on a workstation with an NVIDIA RTX GPU; for methods that required training from scratch
(e.g. baseline models), we ensured training conditions were similar for fairness.</p>
        <p>Convergence Behavior: The proposed GA rapidly converged to high-performing
architectures. Figure 2 illustrates the accuracy change over generations, plotting the best
individual’s validation accuracy at each generation. We observe a steep increase in accuracy in the
early generations, as the GA quickly discovers better architectures than the random initial ones.
After about 10 generations, the improvement plateaus, and by generation ~50 the algorithm meets
the stopping criterion with only marginal gains beyond this point. This indicates convergence.
Notably, the best model’s accuracy approaches the theoretical maximum for the dataset, while
complexity metrics are simultaneously kept low through the multi-objective pressure. The
fluctuations in the average fitness diminish over time, showing the population becoming more
uniformly high-performing. The final chosen architecture emerged in generation 18 and
maintained top fitness thereafter (no further significant improvements in subsequent generations).
This convergence behavior demonstrates the efficiency of our approach in navigating the search
space – within a few dozen generations, it found a CNN structure that would be difficult to design
manually.</p>
        <p>Performance on Test Set: On the held-out test set of gestures, the optimized CNN achieved a
classification accuracy of 98.7%, which is an excellent result, outperforming several baseline
approaches we compare against. The model’s inference time for a single image is 2.3 milliseconds
on the RTX GPU (batch size 1) making it feasible for real-time use on embedded devices. To put
these results in context, we evaluated two reference models on the same data: (a) a standard
ResNet-18 model (11.7M parameters) trained on the gestures, which achieved 95.0% accuracy, and
(b) an EfficientNet-B0 model (about 5.3M parameters) with transfer learning, which achieved 97.5%
accuracy. Our evolved model not only surpasses the accuracy of ResNet-18 by a significant margin,
but does so with 85% fewer parameters and an order of magnitude fewer FLOPs,
demonstrating superior parameter efficiency. Compared to EfficientNet-B0, our model is slightly
more accurate and uses ~66% fewer parameters. These gains highlight the power of multi-criteria
optimization: the GA discovered architectural patterns (like combining an SE block with a custom
convolutional block) that yield high accuracy without bloating the model.
76.22
74.19</p>
        <p>We compiled a comparison of our proposed method against other existing optimization
methods in Table 3. The table includes metrics for accuracy, training time, model size, complexity,
and number of generations to converge. All methods were evaluated or cited in the context of
achieving high accuracy on similar image classification tasks.</p>
        <p>The maximum error percentage decreased significantly from 37% to 16% over 50 generations.
This indicates that even the worst-performing models in the population improved significantly.
The midpoint error rate saw a substantial improvement, reflecting overall population improvement.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>We have presented a deep exploration into multicriteria optimization of CNN architectures,
culminating in a novel evolutionary structural-parametric synthesis algorithm for designing hybrid
CNNs. By reviewing recent NAS approaches and identifying the challenges in balancing accuracy
with efficiency, we motivated the need for a multi-objective solution. Our proposed GA-based
method integrates ideas from neuroevolution and modern CNN design to automatically discover
high-performance networks. In experiments on hand gesture classification, the method found an
architecture that outperforms manually-designed and single-objective optimized networks in both
accuracy and resource usage. The key to this success is the multicriteria fitness evaluation
considering accuracy, speed, and size simultaneously guides the search towards Pareto-optimal
models that traditional approaches might miss. The resultant hybrid CNN leverages advanced
building blocks (residual connections, SE attention) in a compact form, illustrating the power of
combining human-inspired design elements with automated search. Future work will extend this
approach to even more criteria (such as robustness to adversarial inputs) and to other domains like
video gesture recognition (where temporal dynamics add another layer of complexity). We believe
this research contributes a significant step toward automated, multi-objective deep learning model
design, enabling practitioners to obtain tailored neural networks that meet the precise needs of
real-world applications.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Sahoo</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Prakash</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ; Pławiak,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Samantray</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network</article-title>
          .
          <source>Sensors</source>
          <year>2022</year>
          ,
          <volume>22</volume>
          , 706. https://doi.org/10.3390/s22030706
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Zhichao</given-names>
            <surname>Lu</surname>
          </string-name>
          , Ian Whalen, Vishnu Boddeti, Yashesh Dhebar, Kalyanmoy Deb, Erik Goodman, Wolfgang Banzhaf, NSGA-Net:
          <article-title>Neural Architecture Search using Multi-Objective Genetic Algorithm</article-title>
          .
          <source>GECCO</source>
          <year>2019</year>
          . https://doi.org/10.48550/arXiv.
          <year>1810</year>
          .03522
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Han</given-names>
            <surname>Shi</surname>
          </string-name>
          , Renjie Pi, Hang Xu,
          <string-name>
            <given-names>Zhenguo</given-names>
            <surname>Li</surname>
          </string-name>
          , James T. Kwok, Tong Zhang,
          <article-title>Bridging the Gap between Sample-based and One-shot Neural Architecture Search with BONAS</article-title>
          .
          <source>NeurIPS</source>
          <year>2020</year>
          . https://doi.org/10.48550/arXiv.
          <year>1911</year>
          .09336
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. R. T.</given-names>
            <surname>Hax</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Penava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krodel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Razova</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Buettner</surname>
          </string-name>
          ,
          <article-title>"A Novel Hybrid Deep Learning Architecture for Dynamic Hand Gesture Recognition,"</article-title>
          <source>in IEEE Access</source>
          , vol.
          <volume>12</volume>
          , pp.
          <fpage>28761</fpage>
          -
          <lpage>28774</lpage>
          ,
          <year>2024</year>
          , https://doi.org/10.1109/ACCESS.
          <year>2024</year>
          .3365274
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Barret</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Neural Architecture Search with Reinforcement Learning</article-title>
          .
          <source>Machine Learning (cs.LG)</source>
          .
          <year>2017</year>
          . https://doi.org/10.48550/arXiv.1611.01578
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>"SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy,"</article-title>
          <source>2023 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , Vancouver, BC, Canada,
          <year>2023</year>
          , pp.
          <fpage>6153</fpage>
          -
          <lpage>6162</lpage>
          , doi: 10.1109/CVPR52729.
          <year>2023</year>
          .
          <volume>00596</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>"A Strength Pareto Evolutionary Algorithm Based on Reference Direction for Multiobjective and Many-Objective Optimization,"</article-title>
          <source>in IEEE Transactions on Evolutionary Computation</source>
          , vol.
          <volume>21</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>329</fpage>
          -
          <lpage>346</lpage>
          ,
          <year>June 2017</year>
          , doi: 10.1109/TEVC.
          <year>2016</year>
          .
          <volume>2592479</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zgurovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sineglazov</surname>
          </string-name>
          , E. Chumachenko, “
          <source>Classification and Analysis of Multicriteria Optimization Methods” in Artificial Intelligence Systems Based on Hybrid Neural Networks</source>
          , vol
          <volume>904</volume>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>174</lpage>
          , doi: 10.1007/978-3-
          <fpage>030</fpage>
          -48453-
          <issue>8</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>"Feedback Convolutional Neural Network for Visual Localization and Segmentation,"</article-title>
          <source>in IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>41</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>1627</fpage>
          -
          <issue>1640</issue>
          ,
          <issue>1</issue>
          <year>July 2019</year>
          , doi: 10.1109/TPAMI.
          <year>2018</year>
          .
          <volume>2843329</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Albanie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sun</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>"Squeeze-and-Excitation Networks,"</article-title>
          <source>in IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>42</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>2011</fpage>
          -
          <lpage>2023</lpage>
          , 1 Aug.
          <year>2020</year>
          , doi: 10.1109/TPAMI.
          <year>2019</year>
          .
          <volume>2913372</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>