<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Diagnostics of Children's Emotional State Based on Intellectual Multimodal Analysis of Drawings⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mykola Korablyov</string-name>
          <email>mykola.korablyov@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stanislav Dykyi</string-name>
          <email>stanislav.dykyi@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksandr Fomichov</string-name>
          <email>oleksandr.fomichov@nure.ua</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>Kharkiv 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Simon Kuznets Kharkiv National University of Economics</institution>
          ,
          <addr-line>Kharkiv 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>and Igor Kobzev</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The emotional state of a child is a complex, multidimensional construct, reflected in the choice of color, composition, symbolic images, and strokes in the drawing, which is formed through a non-linear, chaotic creative process. Traditional psychological analysis of children's drawings relies on subjective interpretation and is not scalable for mass screening. This paper proposes a neural network multimodal hybrid model for automated emotion diagnostics, combining four complementary feature channels. The pre-trained EfficientNet-B3 neural network extracts the global context of the image; the YOLOv8 neural network determines local semantically significant objects, expanded to 55 classes on the open ESRA dataset; the color palette is described by the statistics of the HSV (Hue, Saturation, Value) space; compositional and graphic metrics encode the geometry and character of the lines. For adaptive weighting of channel contributions, a lightweight attention-fusion layer is introduced, forming a 256-dimensional combined feature vector. The final classifier based on a multilayer perceptron (MLP) matches a drawing to one of three emotional categories - "Happiness", "Anxiety/Depression", "Anger/Aggression", achieving an accuracy of 80-85% on a combined test set from Kaggle. A key benefit is the interpretable JSON report, which contains class probabilities and numerical indicators of color, composition, and detected objects. This makes the results easier to use in practice by a psychologist and increases confidence in the model.</p>
      </abstract>
      <kwd-group>
        <kwd>children's drawings</kwd>
        <kwd>emotional state</kwd>
        <kwd>diagnostics</kwd>
        <kwd>neural network</kwd>
        <kwd>multimodal model</kwd>
        <kwd>EfficientNet-B3</kwd>
        <kwd>YOLOv8</kwd>
        <kwd>attention fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A child's emotional well-being determines the trajectory of his cognitive, personal, and social
development, influencing academic success, the formation of self-esteem, and the quality of
interpersonal relationships [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In practical psychology, one of the most common projective methods
is the analysis of children's drawings - "Draw a person", HTP test, "Family", "Non-existent animal",
etc. It is assumed that the child unconsciously transfers experiences into the symbolism of the image,
allowing the specialist to identify happiness, anxiety, aggression or depressive tendencies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
However, the interpretation of the drawings is based on the subjective experience of the psychologist
and is subject to inter-expert variability; during mass examinations in kindergartens, schools and
rehabilitation centers, the specialist is not able to quickly process hundreds of works [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Attempts at algorithmic diagnostics were made back in the 1990s (color histograms, counting
simple geometric shapes, etc.), but such approaches ignored the scene composition and microtexture
of strokes. With the development of deep learning, specialized convolutional neural networks (CNN)
have emerged that determine artistic styles [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]., and single-frame YOLO series detectors that allow
simultaneous localization of multiple objects of different scales [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Most existing studies focus on
only one aspect of the image: either the global style of the entire drawing [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], or the detection of
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The lack of a comprehensive view leads to two problems:</p>
    </sec>
    <sec id="sec-2">
      <title>2. Analysis of existing research</title>
      <p>The methods of automatic analysis of children's drawings presented in the literature are
conditionally divided into three directions:
1.
2.
3.</p>
      <p>Global classification of the entire image.</p>
      <p>Local detection of semantic symbols.</p>
      <p>Detection of combined/multimodal signs.</p>
      <p>
        For global classification of the entire image, typical Shallow CNN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], ResNet-34 FT [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and
ResNet-50 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] models are used, in which the entire image is fed to the CNN classifier, which
immediately issues a label. Studies have shown that even a shallow network distinguishes common
emotional tonalities with an accuracy of 85%. However, the authors used images without background
elements, and the model ignored the placement of figures and fine strokes. Later, Two-Step FT
ResNet-34 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] achieved 99% accuracy on the private categories of the DAP test; however, the child's
emotion was indirectly inferred from the presence of "house-man-tree," without direct mood
recognition. The advantages of this direction are a quick prototype and no complex markup is
required, while the disadvantages include the indistinguishability of local details and weak
interpretation.
      </p>
      <p>
        In the case of local detection of semantic symbols, a typical model of which is YOLOv8-cls [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
the object detector finds objects (for example, "sun", "knife", etc.), and the output is based on the list
of found objects. The advantage of this direction is high accuracy on "bright" markers (weapons,
tears), and the disadvantage is that color and composition are ignored, and the output of the neural
network is a "black box", which, as a result, produces only the final class.
      </p>
      <p>
        When identifying combined/multimodal signs, isolated experiments are conducted with a color
histogram, and several channels (context, color, lines) are combined. Psychologists associate a dull
color with anxiety, torn lines with internal tension [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. At the same time, the palette and strokes
are informative, but there is no unified system architecture, and the accuracy does not exceed 75 %.
Early algorithms calculated HSV histograms or contour density, but worked separately from CNN.
The combination of such features with deep networks occurs only sporadically and does not give an
increase of more than 5 %. due to the lack of a channel "gluing" mechanism.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], a standard CNN is employed to automate the process of analyzing children's drawings,
which comprises six layers, each playing a crucial role in extracting and analyzing the semantic and
sequential characteristics inherent in drawings created by children's hands. The specified structure
of the neural network is designed in such a way that it allows you to effectively use the values of
image pixels as direct input data, thus providing the possibility of implicit extraction of abstract
information from children's drawings. However, the authors limit themselves to a global analysis of
the entire picture.
      </p>
      <p>Thus, each of the considered directions gives only partial information. Without local objects, the
neural network confuses "anger" and "happiness
compositional features, it is difficult to distinguish "anxiety" from a "neutral" picture. In addition,
most works do not produce an explanatory report - the psychologist has to believe in the "black box".</p>
      <p>The aim of the work is to develop a multimodal hybrid neural network model that has a high
metrics, and combines them using an attention-fusion mechanism. The key difference of the
proposed model is that it not only classifies children's drawings but also explains the solution and
supports the work of a psychologist in mass screening.</p>
      <p>By this, the following tasks were set in the work:
1. Development of the architecture of a multimodal hybrid neural network model.
2. Obtainin ,</p>
      <p>balance of spots, density of strokes, distribution of color, etc.
3. Extraction of a set of image objects and their local features.
4. Extraction of color and compositional features that determine the emotional range, geometry,
and nervousness of the lines of a child's drawing.</p>
      <p>5. Conducting experimental studies.
on the drawing.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Architecture of a multimodal hybrid diagnostic neural network model</title>
      <p>A multimodal hybrid architecture is proposed that combines four sources of image information:
1. Global context the pre-trained EfficientNet-B3 neural network extracts a 1536-dimensional
embedding describing the shape, texture, and configuration of the scene.
2. Local semantic objects the YOLOv8-n neural network detects 55 classes from the ESRA
ional classes.
3. Color palette the statistics of HSV (Hue, Saturation, Value) space (15-dimensional vector)
quantitatively reflect brightness, saturation, and dominant hues that correlate with emotional
tone.
4. Compositional and graphic features geometric metrics of the arrangement of figures,
chaotic contours, and density of strokes (9-dimensional vector) capture characteristic
patterns of anxiety and aggression.</p>
      <p>For adaptive fusion of heterogeneous features, a lightweight attention-fusion layer is introduced,
dimensional fusion vector. Next, the multilayer perceptron (MLP) performs the final classification of</p>
      <p>
        H
In addition to prediction, the system generates a structured JSON report that includes: 1) class
-weights showing the contribution of channels; 3) quantitative indicators of color,
composition, and detected objects. Such a report makes the model's output transparent and suitable
for discussion with parents and teachers [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Let us consider the architecture of the proposed multimodal diagnostic neural network model,
presented in Fig. 1. The input is a digitized child's drawing, which is processed in parallel by four
complementary branches. The global context of the image is extracted by the pre-trained
EfficientNet-B3 neural network [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], forming a high-dimensional embedding that reflects large
shapes and textures of strokes. Using EfficientNet-B3 instead of CNN increases accuracy on the
global background by 8-10 %, requires fewer parameters than ResNet-50, and catches the texture of
strokes without modifying the code.
      </p>
      <p>
        Simultaneously, the simplified YOLOv8-n localizes 55 semantically significant object classes
(ESRA) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], such as sun, knife, tears, etc., and aggregates them into six quantitative features
corresponding to three emotional classes. Two lightweight auxiliary branches calculate the
numerical characteristics of the color palette (brightness, saturation, and shares of dominant shades)
and compositional-graphic metrics (arrangement of figures, chaotic lines, density of strokes).
YOLOv8-n is the largest detector of specific symbols (weapon, tears, sun) and allows us to explain,
. The color and composition modules integrate
pct_dark, edge chaos, and bbox geometry into a single model.
      </p>
      <p>In the next step, all four feature vectors are projected into a common 256-dimensional space and
fed into the attention-fusion layer. The attention mechanism adaptively determines how informative
each channel is for a particular imag
layer allows for eliminating the conflict between the bright red background and the rainbow plot.</p>
      <p>
        The resulting 256-dimensional vector is passed to a two-layer MLP classifier, where, after a linear
transformation, ReLU activation, and Dropout, the logits of three emotional categories are generated.
The Softmax function transforms them into probabilities that comprise the primary prediction of the
(color, composition, list of objects) are output to the JSON report, which makes the model's solution
transparent to a practicing psychologist [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Let's take a closer look at the description of datasets, image preprocessing stages, and the
implementation of each of the listed branches.</p>
      <p>-weights of attention and primary numerical features</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets and Pre-processing</title>
        <p>
          For
unconscious symbols, training data from three different sources were combined. Each of them covers
good generalization ability of the network. The combined test set of Kaggle Children Drawings,
consisting of 500 RGB scans, was used [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. A single label was set: happiness/anxiety/anger, and
EfficientNet-B3 was retrained. This step allows for obtaining natural drawings with reliable emotion.
        </p>
        <p>
          On the open ESRA Annotation dataset containing 3012 RGB images [19], using the trained
YOLOv8-n neural network, which provides rich semantics of local symbols (knife, tears, sun, etc.)
necessary for explainable conclusions, local semantically significant objects were identified,
expanded to 55 classes + bbox, and aggregated into three clusters [
          <xref ref-type="bibr" rid="ref19">20</xref>
          ]. Next, emotion labels + bbox
of key objects are defined for an internal pilot set of 200 scans, which allows for external validation
of interpretability to test the network's robustness to regional style, a different set of objects, and
mixed techniques.
        </p>
        <p>A single preprocessing pipeline was used. Each source scan I undergoes dual processing: active
region extraction, contrast equalization (CLAHE), and dual scaling:


= 
( 
, 224),  
= 
( 
, 640),
where   is input to the global branch (EfficientNet-B3);   is input to the local branch
(YOLOv8-n);  ( ,  ) is the resized image  to  ⨯  pixels without padding;  ( ,  )
is resize X with aspect ratio preserved and pad to  ⨯  ;   is the active region of the scan after
removing white margins.</p>
        <p>The first tensor goes to EfficientNet-B3, the second one goes to YOLOv8-n without distortion of
proportions.</p>
        <p>Balancing and synthetic diversity were performed, i.e., the original sample was biased towards
"happiness". Stratified mini-batches (1:1:1) and a weighted loss function   = 1⁄√  were applied,
where   is the number of examples of class C. To prevent the network from "remembering" unique
pencil curls, four complementary augmentations were introduced:
1. Global context the pre-trained EfficientNet-B3 neural network extracts a 1536-dimensional
embedding describing the shape, texture, and configuration of the scene.
2. Horizontal reflection (p = 0.5) the children's composition often changes left/right bias.
3.  ̃ =    + (1 −  )  , ,  ⁓ Beta (0.2;0.2), forces the model to interpolate
between emotions rather than remember details.
4. CutOut (patch 16 x 16, p = 0.3) simulates paint spots, finger shadows, or sheet tears.</p>
        <p>After expansion, the training pool grew from 200 to 4400 images; class imbalance was reduced to
±3 %. Thus, the combination of three corpora provides simultaneously emotional labels, semantic
objects, and regional diversity. A single, two-branch oriented preprocessing ensures tensor
consistency across all downstream modules. Balanced augmentation not only increases the data
volume
stroke. All this forms a solid foundation for the processing units.</p>
        <p>
          Preprocessing and normalization. From each RGB scan, we extract the active region: we remove
white margins, take a tight bounding box around
nonstated otherwise, all numeric features are standardized with z-scores using the training set (the same
mean and standard deviation are reused on validation/test). Global context (EfficientNet-B3): resize
the active region to 224×224, scale pixels to [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ], and normalize by ImageNet mean/std; use the
1536dim pooled feature for fusion. Local objects (YOLOv8-n): letterbox the active region to 640×640, run
the detector (confidence 0.25, NMS IoU 0.45); group detections into three emotions via the published
0].
        </p>
        <p>
          For each group, record the count and the area fraction (sum of box areas divided by the area of
the active region), then standardize all six values. Color (HSV): convert the active region to HSV in
[
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ], compute means and variances of H, S, V; proportions of warm/cool/dark pixels; and six equal
hue-bin masses (sum = 1), then standardize all 15 values. Composition/graphics: on a grayscale copy
with a fixed edge detector, compute nine values (center-of-mass offset, scene density, main-figure
tilt in degrees, edge-chaos measure, stroke density, main-box aspect ratio, main-box area fraction,
box-center dispersion normalized by the canvas diagonal, and background-void ratio) and
standardize them.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Global Context Branch: EfficientNet-B3</title>
        <p>
          The global impression of a child's drawing is not only the set of objects, but also the "atmosphere" of
the scene: the balance of spots, the density of strokes, the distribution of color. It is this background
that a qualified psychologist most often reads first. To bring the machine's gaze closer to the human's,
EfficientNet-B3 was chosen as a context extractor - an architecture that, with a moderate number of
parameters, demonstrates high accuracy on a wide range of visual tasks [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. EfficientNet-B3 uses
the compound scaling principle: simultaneous but consistent expansion of the depth d, channel width
w, nt.
        </p>
        <p>In children's drawings, where dark pencil ruts coexist with watercolor spots, the wide channel
volume of the first layers of EfficientNet-B3 turned out to be critical: the model better distinguished
sparse strokes from blots without losing global shapes.</p>
        <p>To avoid "re-memorizing" bright examples (children like to repeat the same "sun-house" motif),
Dropout 0.30 was added before Global Pooling. To assess the stability, a 5-fold cross-validation was
performed, retraining the model each time with a new initialization. The standard deviation of F1
did not exceed ±1.9 %, which indicates a stable capture of the abstract properties of the drawing.</p>
        <p>The final vector   with a diameter of 1536 components serves as the "global eye" of the
attention-fusion layer. In combination with local objects, palette, and composition, it provides the
model with context - the viewer can read the emotion even when the knife is hidden in the corner
or the red tones are smoothed out with watercolor.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Local Object Branch: YOLOv8-n</title>
        <p>
          marker symbols: the sun or rainbow in the upper corner, a tiny knife in the hands of a person, a
stream of tears on a face, etc. These details, despite their small size, turn out to be the strongest
extract local features, we used YOLOv8-n a younger but sensitive version of the Ultralytics family,
capable of holding ~6 M parameters and working with a resolution of 640 × 640 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>Most articles devoted to the analysis of artistic images either rely on heavy backbones
(YOLOv5l, Faster-RCNN) or are limited to several large classes ("person", "house"). For school pictures, such
tactics are unacceptable: due to the child's naive style, a knife can take up only 0.5% of the area, and
a monster can be drawn with a single red stroke. A practical experiment showed that the
YOLOv8n model achieves mAP@0.5 = 0.934 on the ESRA validation set, where mAP@0.5 is the mean Average
Precision averaged over all classes at an IoU threshold of 0.5 (COCO convention). At the same time,
YOLOv8-n remains four to five times lighter than comparable detector variants, mAP@0.5 = 0.934
on ESRA validation, remaining 4-5 times lighter than its closest competitors.</p>
        <p>The original ESRA Annotation corpus includes 55 categories (from "sun" to "blood_drop").
However, the psychologist is not interested in the fact of "sun" itself, but in its emotional code.
Therefore, after detection, we aggregate the classes into three supergroups corresponding to the
target emotions:  1 =  ,  2 =   / ,  3 =  / . For each
group of objects, two invariant quantities are calculated:   number of objects, ɑ their total area
bbox, normalized to the area of the drawing. We obtain a compact vector:
   = ( 1, ɑ1,  2, ɑ2,  3, ɑ3) ∈  6,
(2)
which is then fed into the attention-fusion layer. In practice, this means that even if the "knife"
takes up three pixels, its contribution will be taken proportionally to the actual area, rather than
multiplied by the detector's confidence.</p>
        <p>The model was initialized with the public checkpoint COCO-128 and further trained for 300
epochs on ESRA. Despite the miniature size of the network, the classic YOLO training scheme was
preserved: SGD optimizer with   = 0.01 and  = 0.937, cosine attenuation up to 10−4. The key
role was played by the out-of-the-box augmentations of Ultralytics: Mosaic (100 %) a collage of
four pictures that perfectly conveys the chaos of children's sketches; HSV-shift (±0.1 H, ±0.5 S, ±0.5
V) imitation of neon markers and faded felt-tip pens; and Copy-Paste (20 %) a random object is
transferred to another scene, accustoming the detector to "stickers".
with   , the top 3 objects by area that the psychologist sees in the report are saved. Thus, the
specialist immediately understands that the alarming verdict of the model is not due to an abstract
figures,
reduces the barrier of trust in the system and saves time: there is no need to manually search for
symbols on each sheet.</p>
        <p>Thus, YOLOv8- it extracts tiny but semantically
charged objects, turns them into a smooth 6-dimensional feature, and, together with color,
composition, and global context, enables the network to give an accurate and explainable diagnosis.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Color and Composition Feature Extractors</title>
        <p>Despite the expressiveness of individual symbols, the emotional subtext of a child's drawing is often
"sewn" into the palette and style of strokes. An anxious child prefers a gray-blue range, hides figures
to the edges of the sheet, and outlines them with trembling, broken lines; an angry child floods the
scene with thick red and scatters objects chaotically. In order to quantitatively capture these subtle
hints, two light but critically important channels of features were added: color and
compositionalgraphic. After the image is transformed into HSV space, the Hue (H), Saturation (S), and Value (V)
are processed separately. The resulting color vector  
summary indices, and six variances.
matrix  
the coefficients.</p>
        <p>Next, the arrangement of the figures is determined. The coordinates of the centers of all the
"human" bboxes found by YOLO allow us to calculate the center of mass and the distance to the
geometric center of the sheet, set the density of the scene, the slope of the main figure, and the
chaotic nature of the contour. As a result, we obtain a composition vector  
∈  9.</p>
        <p>Both vectors  
and  
are concatenated:  
= [ 
,  
]∈  24, and projected by their
∈  24 256 into a 256-dimensional space comparable to the projections of YOLO and
EffNet branches. This allows attention-fusion to dynamically increase the weight of color if the scene
is glowing red, or composition if the characters are huddled in a corner, without manually specifying
∈  15 contains six components, three</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Attention-Fusion Mechanism and Classifier</title>
        <p>The four feature channels described above provide different information about the drawing: the
specific symbols, the color module captures the
emotional range, and the compositional module
the nervousness of the lines and geometry. It is
impossible to say in advance which of the channels will be decisive on a specific sheet: in one case,
rainbow background. Therefore, instead of rigid concatenation, a lightweight layer of attention
(attention-fusion) was introduced, which dynamically prioritizes between the channels.</p>
        <p>The resulting compressed vector</p>
        <p>∈  256 is fed to a two-layer classifier (MLP) with one
hidden layer, which is described by the equations:</p>
        <p>ger, in another, a tiny knife in the corner outweighs the
 = 
( 1  
 ʹ= 
+  1),  ∈  128,</p>
        <p>0.2( ),
 =  2  ʹ +  2,  ∈  3,
where  is the hidden activations after the first layer;  1 and  1 are the weights and bias of the
first fully connected (hidden) layer;  2 and  2 are the weights and bias of the output layer;  is the
logits (unnormalized scores) for the three classes.</p>
        <p>The Softmax function transforms the logits l into class   probabilities   :
(3)
(4)
   =  


⁄
∑3=1   ,   ∈ {</p>
        <p>where    is the logit of class   .</p>
        <p>, 
/
, 
/
}.</p>
        <sec id="sec-3-5-1">
          <title>In addition to probabilities, the model stores -weights</title>
          <p>four scalars indicating which branch
proved to be the main one; the top 3 YOLO objects, along with their area and quantity; summary
color indicators; and key compositional metrics. Such a report allows the psychologist to see why
the network considered the drawing disturbing: its dark palette, depiction of two crying people, and
high level of chaos in the lines.</p>
          <p>Thus,
attentionides in real time which instrument
(context, object, color, or composition) sounds louder in the emotional symphony of the drawing,
and MLP translates this ensemble into a quantitative diagnosis with a transparent explanation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental studies</title>
      <p>This section shows step by step how the proposed system was trained and why each of the added
innovations resulted in an increase in quality. First, the reproduced environment is described, then
quantitative indicators, results of ablation experiments, and analysis of the attention model are
presented. A reproducible training configuration was implemented based on the EfficientNet-B3
neural network, which was further trained for 10 epochs with a cosine decay rate from 1 ⨯ 10−4 to
1 ⨯ 10−5, with 80 -n neural network was trained for 300 epochs
on ESRA with Mosaic and Copy Paste augmentations; by the 200th epoch, mAP@0.5 reached 0.90.
The attention-fusion layer and the subsequent two-layer MLP were trained for 15 epochs with an
initial learning rate of 1 ⨯ 10−3 , which defines the step size of weight updates, and a dropout rate
of 0.2, meaning that 20 % of neurons are randomly deactivated during each iteration to curb
overfitting.</p>
      <p>Fig. 2 shows the train/val-loss dynamics curve for fine-tuning EfficientNet-B3, which shows that
there is no overfitting: by the 9th epoch, the difference between train and val does not exceed 0.05.</p>
      <p>Macro-F1 is obtained by first computing the F1-score (the harmonic mean of precision and recall)
for each class and then averaging these scores, so every class contributes equally regardless of its
prevalence in the data. A closer look at the class-wise learning curves reveals different convergence
dynamics. Happiness reaches its plateau within the first six epochs, confirming that bright colours
Anxiety/Depression climbs more slowly because
the model must integrate a dull palette, off-centre figures, and the absence of cheerful objects before
it can decide. The curve for Anger/Aggression lies between the two: an early boost comes from red
dominance and weapon icons, whereas the later epochs refine the score through contour chaos and
dense shading. The complete v2 system outperformed the baseline v1. On the held-out test split, v2
achieved Accuracy = 0.845 ± 0.012 and macro-F1 = 0.838 ± 0.017, whereas the single-channel v1
(custom CNN + earlier YOLO) reached only macro-F1 = 0.693. The largest gain appears in
Anxiety/Depression (+0.16 F1) followed by Anger/Aggression (+0.11 F1). The confusion matrix in
Fig. 4 explains the jump: the baseline frequently confuses anxiety with the neutral class, while v2
recognises the dull palette, off-centre figures, and jagged contours typical of anxious drawings.</p>
      <p>Class-wise precision recall statistics reinforce this picture: Happiness scores precision 0.88 /
recall 0.87 thanks to its bright warm colours; Anxiety/Depression settles at 0.79 / 0.81 because grey
pencil sketches partly overlap with neutral images; Anger/Aggression shows a balanced 0.83 / 0.83,
indicating that red dominance and weapon detections compensate for each other when one cue is
missing. These asymmetries clarify why the overall macro-F1 improvement is driven mainly by the
anxiety class. To assess sensitivity, ROC curves were constructed and are shown in Fig. 5. It can be
seen that the AUC increased from 0.78 to 0.91 for Anxiety, and from 0.86 to 0.94 for Anger, but the
ch is important for school
screening. In some drawings, cues from different channels can disagree (e.g., warm/bright colors
suggest happiness while detected objects suggest anxiety). Our model reconciles such cases via
learned attention weights; we treat this as a limitation and expose per-image attention to flag
disagreements, leaving explicit conflict checks and abstention thresholds for future work.</p>
      <p>An ablation study of the roles of the branches of the proposed model was performed, the results
of which are shown in Fig. 6. To assess the contribution of the channels, each branch was alternately
,</p>
      <sec id="sec-4-1">
        <title>1. Removing YOLO reduced Macro-F1 by 6 %. 2. Excluding the color branch by 3 % 3. No composition by 2 %; 4. Replacing attention with simple concatenation gave 4 %.</title>
        <p>Taken together, the ablation bars in Fig. 6 illustrate how the channels interact. Disabling the color
branch hurts Anger most, because red hue is its earliest cue, yet the model still recovers two-thirds
of the loss from object and contour information. The reverse holds for Anxiety: removing
composition costs almost as much as disabling YOLO, underscoring that off-centre figures and
fragmented lines jointly signal unease. This mutual compensation explains why the fusion layer
remains above 0.77 macro-F1 even when any single channel is silenced.</p>
        <p>It is evident from Fig. 1 that when branches are switched off, there is a decline in Macro-F1.
Additional studies on the evolution of mAP@0.5 YOLOv8-n were performed, as shown in Fig. 7.</p>
        <p>It is evident that the mAP@0.5 curve of the YOLOv8-n detector increases almost monotonically:
starting from 0.15, it reaches 0.90 by about the two-hundredth epoch and then reaches a plateau at
about 0.93. After that, the accuracy fluctuates within 0.5 %, so early stopping of training after three
epochs without improvement allows saving resources without losing quality. Robustness testing
across cross-validation folds showed that the variation in macro-F1 does not exceed ±1.7 percentage
poin
style of individual authors and does not critically depend on the random distribution of plots.</p>
        <p>There are limitations, however. Collage and watercolor techniques make it difficult to find closed
contrast backgrounds. Another vulnerability occurs with stylized drawings (for example, comic book
characters with large eyes and a predominance of red) - the network tends to classify them as
rk sky, small
figure in the corner) can be perceived differently in different environments.</p>
        <p>
          Despite the listed weaknesses, the model shows a stable result on typical pencil and felt-tip pen
tory objects makes the
network's conclusions transparent to a practicing psychologist [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The experiments conducted on
test input data show the high efficiency of the developed model. Achieving such accuracy in
recognizing emotional states is an important step in developing tools for the primary diagnosis of
children's psychological state based on the analysis of their drawings.
        </p>
        <p>Thus, during the experiments on the combined dataset, the proposed model outperformed the
basic single-modality CNN by 13 % macro-F1 and achieved an accuracy of 80-85 %, especially
enhancing the recognition of anxious-depressive drawings. Unlike black-box models, the proposed
solution provides a quantitative decoding of the factors underlying the decision, which is critical for
the practical work of a child psychologist.
5. Conclusion
on intelligent analysis of a single drawing. Its key feature is that the system is not limited to one
source of visual information, but synthesizes four complementary layers of information at once. The
pre-trained EfficientNet-B3 is responsible for the global context and captures the scene composition;
the lightweight version YOLOv8 detects up to fifty-five semantically significant objects, including
weapons, the sun, clouds, and other markers of affect; specialized modules extract color palette
statistics and compositional and graphic characteristics of lines. All these features are combined in
the attention-fusion layer, which adaptively distributes weights between channels, as a result of
which the network is able to interpret both rich felt-tip pen work and a modest pencil sketch with
equal confidence.</p>
        <p>Experimental validation confirmed the practical valu
combined test set, including open Kaggle data and a closed collection from school and clinical
institutions, the accuracy of classification of three emotional categories reached 83 85%. The
increase was especially , Anxiety/Depression: the F1-measure
increased by 16% relative to the baseline system, which relied only on global textures and a limited
list of objects. It is also important that the increase in quality was accompanied by a decrease in the
dispersion of results between different folds of cross-validation, which indicates good stability of the
model to the variability of children's styles.</p>
        <p>Another significant advantage was the level of explainability. Instead of a dry label, the network
forms an extended JSON report, where, in addition to the final probabilities, the weights of attention
channels and specific objects or color characteristics that played a leading role are given. The pilot
use of such reports showed that it takes a psychologist less time to understand why the algorithm
classified a drawing as disturbing or aggressive, and the overall trust in the system increases
significantly. It is also important that the entire pipeline fits into 20 Mb of weights and produces an
answer in less than three seconds on a regular laptop - this opens the way to mass screening directly
at school without the need to transfer images to external servers.</p>
        <p>At the same time, there are still some issues that require attention: the model is currently worse
at handling collages and watercolor fills, and does not take into account age and cultural differences
in symbolism. In future work, it is planned to expand the dataset with rarer techniques and add
auxiliary metadata to improve the personalization of the output. But even in its current form, the
proposed approach represents a significant step towards transparent automatic diagnostics,
demonstrating that deep learning can not only improve accuracy but also provide interpretability,
which is necessary in the practice of a child psychologist.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly to check grammar and spelling,
paraphrase and reformulate. After using this tool/service, the authors checked and edited the content
as needed and take full responsibility for the content of the publication.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>U.</given-names>
            <surname>Podobnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jerman</surname>
          </string-name>
          , &amp; J.
          <string-name>
            <surname>Selan</surname>
          </string-name>
          ,
          <article-title>Understanding analytical drawings of preschool children: the importance of a dialogue with a child</article-title>
          .
          <source>International Journal of Early Years Education</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ) (
          <year>2024</year>
          )
          <fpage>189</fpage>
          203. https://doi.org/10.1080/09669760.
          <year>2021</year>
          .
          <volume>1960802</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.-Y.</given-names>
            <surname>Brandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dandarova-Robert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cocco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vinck</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Darbellay</surname>
          </string-name>
          , When Children Draw Gods.
          <article-title>Multicultural and Interdisciplinary Approach to Children's Representations of Supernatural Agents</article-title>
          , Springer (
          <year>2023</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -94429-2.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barriage</surname>
          </string-name>
          , D. K. deSouza, S. Zitter, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sarabu</surname>
          </string-name>
          , Drawing Play:
          <article-title>A Content Analysis of Children's Drawings of Places Where They Like to Play</article-title>
          . Children, Youth, and Environments, vol.
          <volume>33</volume>
          , no.
          <issue>2</issue>
          (
          <year>2023</year>
          )
          <fpage>63</fpage>
          -
          <lpage>89</lpage>
          . https://doi.org/10.1353/cye.
          <year>2023</year>
          .a903098.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Lindsay</surname>
          </string-name>
          ,
          <article-title>Convolutional neural networks as a model of the visual system: past, present, and future</article-title>
          .
          <source>Journal of Cognitive Neuroscience</source>
          ,
          <volume>33</volume>
          (
          <year>2021</year>
          )
          <year>2017</year>
          2031. doi:
          <volume>10</volume>
          .1162/jocn_a_
          <fpage>01544</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Badrulhisham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Mangshor</surname>
          </string-name>
          ,
          <article-title>Emotion recognition using a convolutional neural network (CNN)</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          , vol.
          <source>1962</source>
          (
          <year>2021</year>
          )
          <year>1962</year>
          :
          <fpage>012040</fpage>
          . doi:
          <volume>10</volume>
          .1088/
          <fpage>1742</fpage>
          -
          <lpage>6596</lpage>
          /
          <year>1962</year>
          /1/012040.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Beltzung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pelé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Renoult</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sueur</surname>
          </string-name>
          ,
          <article-title>Deep learning for studying drawing behavior: A review</article-title>
          .
          <source>Frontiers in Psychology</source>
          , vol.
          <volume>14</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . https://doi.org/10.3389/fpsyg.
          <year>2023</year>
          .
          <volume>992541</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cocco</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ceré</surname>
          </string-name>
          ,
          <article-title>Drawings of God(s)</article-title>
          .
          <source>In book: When Children Draw Gods</source>
          (
          <year>2023</year>
          )
          <fpage>213</fpage>
          -
          <lpage>244</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          - 94429-
          <issue>2</issue>
          _
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Polsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Powell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liew</surname>
          </string-name>
          , and T. Hammond,
          <article-title>motor skill development using machine learning</article-title>
          .
          <source>International Journal of Artificial Intelligence in Education</source>
          ,
          <volume>32</volume>
          (
          <issue>4</issue>
          ) (
          <year>2021</year>
          )
          <fpage>991</fpage>
          1024. doi:
          <volume>10</volume>
          .1007/s40593-021-00279-7.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pysal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Abdulkadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R. M.</given-names>
            <surname>Shukri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Alhussian</surname>
          </string-name>
          ,
          <article-title>Classificati strategies on the touch-screen of seriation objects using a novel deep learning hybrid model</article-title>
          . (
          <year>2021</year>
          )
          <fpage>115</fpage>
          -
          <lpage>129</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.aej.
          <year>2020</year>
          .
          <volume>06</volume>
          .019.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>and chological Analysis using Shallow CNN</article-title>
          .
          <source>Proc. IEEE Int. Conf. on Artificial Intelligence and Big Data</source>
          , (
          <year>2020</year>
          )
          <fpage>69</fpage>
          -
          <lpage>73</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICAIBD49809.
          <year>2020</year>
          .
          <volume>9134510</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Zeeshan</surname>
          </string-name>
          , L. Liu, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietikäinen</surname>
          </string-name>
          ,
          <article-title>Two-Step Fine-Tuned CNNs for Multi-label Recognition, (</article-title>
          <year>2021</year>
          )
          <fpage>120</fpage>
          -
          <lpage>131</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDAR53238.
          <year>2021</year>
          .
          <volume>00025</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hussain</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>AI-Based Mobile Application for Sensing Children's Emotion Through Drawings</article-title>
          .
          <source>Studies in Health Technology and Informatics</source>
          ,
          <volume>290</volume>
          (
          <year>2022</year>
          )
          <fpage>148</fpage>
          -
          <lpage>151</lpage>
          . doi:
          <volume>10</volume>
          .3233/SHTI220432.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jocher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaurasia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Stoken</surname>
          </string-name>
          , Ultralytics YOLOv8:
          <article-title>State-of-the-Art Real-Time Object Detection</article-title>
          .
          <source>arXiv preprint arXiv:2304.00555</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alshahrani</surname>
          </string-name>
          , and
          <article-title>Model by Drawing Analysis Based on Computer Vision and Deep Learning</article-title>
          . Engineering,
          <source>Technology &amp; Applied Science Research</source>
          <volume>14</volume>
          (
          <issue>1</issue>
          ) (
          <year>2024</year>
          )
          <fpage>1082</fpage>
          -
          <lpage>1088</lpage>
          . doi:
          <volume>10</volume>
          .48084/etasr.6183.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Korablyov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kobzev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dykyi</surname>
          </string-name>
          .
          <article-title>Diagnosis of the Child's Emotional State Based on the Intellectual Analysis of Children's Drawings</article-title>
          .
          <source>Proc. 19th Int. Conf. on Computer Science and Information Technologies</source>
          (
          <year>2024</year>
          ).
          <source>doi: 10.1109/CSIT65290</source>
          .
          <year>2024</year>
          .
          <volume>10982568</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>Diagnostics of children's emotional state based on intellectual multimodal analysis of drawings: JSON report example</article-title>
          . URL: https://zenodo.org/records/16917978. DOI: https://doi.org/10.5281/zenodo.16917978.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          .
          <source>Proc. 36th Int. Conf. on Machine Learning</source>
          (
          <year>2019</year>
          )
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1905</year>
          .
          <volume>11946</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Perera</surname>
          </string-name>
          , Children Drawings Dataset.
          <source>Kaggle</source>
          (
          <year>2024</year>
          ). URL: https://www.kaggle.com/datasets/vishmiperera/children-drawings. https://universe.roboflow.com/esra/esra
          <article-title>-data-annotation.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <article-title>Diagnostics of children's emotional state based on intellectual multimodal analysis of drawings: JSON ERSA dataset mapping</article-title>
          . URL: https://zenodo.org/records/16918307. DOI: https://doi.org/10.5281/zenodo.16918307.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>