<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalability analysis of neural network architectures for voice biometric authentication⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khrystyna Ruda</string-name>
          <email>khrystyna.s.ruda@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Sabodashko</string-name>
          <email>dmytro.v.sabodashko@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ihor Kos</string-name>
          <email>ihor.kos.mkbas.2025@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Opirskyy</string-name>
          <email>ivan.r.opirskyi@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alina Akhmedova</string-name>
          <email>alina.akhmedova.mkbas.2025@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 Stepan Bandera str., 79000 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>349</fpage>
      <lpage>357</lpage>
      <abstract>
        <p>The rapid implementation of digital services in financial, governmental, and commercial domains is accompanied by a growing demand for reliable user identification tools. Voice-based biometric authentication systems represent one of the most promising directions, as they combine ease of use with the potential for integration across a wide range of services-from banking operations and call centers to voice assistants. At the same time, such systems face a number of challenges: the need to scale to tens of thousands of users, to ensure resilience against attacks employing synthetic speech, and to maintain high accuracy in real-time operation. Neural network architectures capable of generating robust embeddings and preserving stability as data volumes increase are of particular importance. This article examines the scalability challenges of modern voice biometrics models and substantiates approaches to enhancing their efficiency, including architectural optimization, indexed search, and the use of representative speech corpora. The study emphasizes the necessity of a comprehensive approach to the development of voice authentication systems, in which technical performance is combined with security requirements and protection against cyber threats.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Voice biometrics</kwd>
        <kwd>scalability</kwd>
        <kwd>speaker verification</kwd>
        <kwd>embeddings</kwd>
        <kwd>authentication</kwd>
        <kwd>ECAPA-TDNN</kwd>
        <kwd>Pyannote</kwd>
        <kwd>WavLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The scalability of a biometric system is defined by its ability to maintain efficiency as the number
of users and the volume of data increase. Biometric technologies—such as fingerprint, face, voice,
and iris recognition—must ensure high accuracy and rapid response even when operating with
millions of enrolled templates [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This requirement is particularly relevant for national
identification systems and voice assistants, which serve massive user bases daily, where
performance must not degrade under growing loads. In this context, scalability implies the ability
to continuously add new users without system reconstruction, while maintaining stable response
times [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2–4</xref>
        ]. Such requirements can be met through optimized search and indexing algorithms,
distributed computing methods, and cloud-based infrastructures [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. However, scaling biometrics
to large datasets introduces a number of challenges. First, computational costs rise sharply as the
user base expands, since 1:N identification requires millions of comparisons; without optimization,
this leads to unacceptable response times. Equally critical are latency and throughput issues, as
authentication must be performed in real time, even in large-scale scenarios such as call centers or
border control. Furthermore, data storage and transmission requirements increase, demanding
efficient handling of large biometric templates (fingerprints, images, embeddings) with compact
representation and fast access. Finally, accuracy must be preserved despite a growing number of
users, the presence of similar biometric patterns, and intra-user variability caused by recording
conditions, background noise, or age-related factors [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7–9</xref>
        ].
Therefore, scalability in biometric systems is a multifaceted characteristic that integrates
performance, accuracy, and security. It requires the application of robust models capable of
operating in real time while meeting stringent data protection requirements [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work and Literature Review</title>
      <p>The analysis of recent scientific sources demonstrates a significant growth of interest in scalable
solutions in the field of voice biometrics. Current research focuses on the development of
embedding models, architectural optimization, and the applied use of technologies in banking
services, call centers, and voice assistants.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the scalability of a voice authentication system based on the TitaNet architecture was
investigated. The author showed that the model maintained high efficiency on smaller datasets but
gradually degraded as the number of users increased. This highlights the need for adaptive
threshold calibration and additional methods to improve accuracy. Thus, even state-of-the-art
architectures reveal scalability limitations, underscoring the necessity of further optimization.
      </p>
      <p>
        A more comprehensive comparative study was presented by Brydinskyi et al. (2024) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], who
analyzed several modern speaker verification models, including ECAPA, TitaNet, WavLM, and
PyAnnote. The authors concluded that speaker embedding technology exhibits strong scalability,
as new users can be added without retraining the model. Particular attention was given to the
ECAPA architecture, which demonstrated a balanced trade-off between accuracy (EER ~1.7%) and
inference speed (~69 ms), making it a promising candidate for large-scale authentication systems.
      </p>
      <p>
        Continuing the topic of optimization, Thienpondt and Demuynck (2023) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed the
hybrid ECAPA2 architecture, which enhances the robustness of speaker embeddings to noise and
short utterances. The model achieved state-of-the-art results on the VoxCeleb1 dataset with fewer
parameters compared to earlier approaches. This indicates the effectiveness of reducing
computational complexity without sacrificing performance—an important factor in system
scalability.
      </p>
      <p>
        Another relevant challenge is speaker identification from short speech fragments. Deng et al.
(2025) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced the Dense-Fusion2Net architecture with a Time-Frequency Channel
Attention (TFCA) module, enabling efficient processing of signals of limited duration. The model
demonstrated strong performance on VoxCeleb datasets, preserving both high accuracy and low
computational cost. This makes it highly suitable for banking and service applications, where user
interaction lasts only a few seconds.
      </p>
      <p>
        A broader historical perspective on the evolution of technologies was provided by Sharma et al.
(2024) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The authors traced the transition from classical i-vector approaches to modern neural
embedding models, emphasizing that the emergence of x-vector and subsequent deep architectures
enabled the move toward scalable systems. Their review highlighted that contemporary methods
allow operation in open-set scenarios, where user databases expand continuously, aligning with
real-world application needs.
      </p>
      <p>
        The issue of security in scalable systems was addressed by Chen et al. (2023) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], who studied
attempts to deceive speaker recognition systems using specially crafted audio adversarial examples.
The authors demonstrated that combining audio signal transformations with adversarial training
improved accuracy by ~13.6% and significantly increased system resistance to such attacks. This
forms the basis for developing more secure and scalable biometric systems.
      </p>
      <p>
        Another important security aspect is privacy and heterogeneous training conditions. Chen &amp; Xu
(2023) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] proposed a personalized federated learning (PFL) approach for speaker verification and
identification tasks. Their framework preserved user data privacy, enabled adaptation to diverse
acoustic domains (rooms, noise conditions, languages), and avoided catastrophic forgetting in
continuous learning (C-PFL). Across 12 simulated scenarios, this approach achieved lower average
EER and better convergence compared to centralized training, highlighting its promise for scalable
real-world systems.
The practical dimension of scalability is illustrated by industrial studies. For example, in the
RudderAnalytics project (2024) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], a system based on SpeechBrain with the VoxCeleb2 dataset
was deployed. The developers showed that the system maintained high accuracy under scaling,
successfully handling a 115% increase in registered users without performance degradation.
      </p>
      <p>
        Meanwhile, even classical architectures remain subject to optimization. Sharif-Noughabi et al.
(2025) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] demonstrated that a modified VGG-CNN, combined with data augmentation techniques,
significantly improved identification accuracy on VoxCeleb1 (from ~84% to over 91%). This
indicates the effectiveness of integrating classical approaches with modern training techniques to
ensure performance on large user databases.
      </p>
      <p>In summary, the analysis of recent publications indicates that the scalability of voice biometric
systems relies on compact and robust embeddings, lightweight architectures, quantization
methods, and solutions for short speech recordings, complemented by security and privacy
mechanisms. Modern neural models are moving toward unifying high accuracy, optimal inference
speed, and resilience to attacks, thereby opening the way for large-scale deployment in banking
services, call centers, and voice assistants.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Research Methodology</title>
      <p>
        Highly accurate neural networks are often very large, which complicates their deployment on
devices with limited resources or under large-scale cloud access. To address the computational
challenges posed by such models, optimization methods are applied, in particular quantization—
reducing the dimensionality of neural networks. For instance, Amazon reduced the Alexa voice
assistant model to less than 1% of its original size without significant loss of accuracy [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. This
was achieved through parameter quantization and knowledge distillation, where a smaller student
model learns from the outputs of a larger teacher model. Optimized models of this kind can be
deployed on smartphones, in-car systems, and even microcontrollers. At the same time, for speaker
identification, fast search in embedding databases is critical. Instead of exhaustive matching,
indexbased methods are employed, enabling scalability to tens of millions of templates [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>Another critical component of scalable biometric systems is training data. The quality, size, and
representativeness of training corpora directly influence model performance and adaptability. To
ensure reliable operation across diverse conditions, models must be trained on large and
heterogeneous datasets. Representativeness is especially important: datasets should cover a wide
variety of accents, languages, and user groups to avoid bias and accuracy drops for
underrepresented populations. For example, voice services trained predominantly on a single
demographic may perform poorly for other groups with distinct accents, timbres, or vocabulary.
Thus, investment in the collection and annotation of balanced, diverse training data is fundamental
for scalable and universal AI systems.</p>
      <p>Scalability is also inseparable from reliability. In high-availability services, such as banking
voice platforms or consumer voice assistants, models must operate continuously (24/7). This
requires deployment in clustered environments with effective load balancing across nodes. Such
architectures prevent bottlenecks, ensure stable performance under peak loads, and support
automatic failover: if one node fails, others seamlessly take over the workload. This approach
guarantees service continuity, fault tolerance, and minimizes downtime risks—critical for online
services, financial platforms, healthcare applications, and other domains where even brief
disruptions can result in financial losses or reduced user trust.</p>
      <p>Accordingly, system scalability encompasses not only the ability to process increasing data
volumes and user requests, but also the assurance of robustness, fault tolerance, and uninterrupted
service availability under real-world conditions.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset Description</title>
      <p>
        For the experimental study of neural model scalability in voice biometric authentication systems, a
multilingual audio dataset was constructed, including both Ukrainian and English speakers [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
The Ukrainian portion of the dataset consisted of recordings from 50 public figures (primarily
politicians, government officials, and communicators), collected from open sources. As the main
source, we used a public dataset hosted on the Hugging Face platform [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], which provided
structured Ukrainian audio files of sufficient quality for the research. Each speaker was represented
by 10 recordings with an average duration of approximately 10 seconds, ensuring an adequate
amount of data for generating reference embeddings.
      </p>
      <p>To evaluate model scalability under conditions of linguistic variability, the dataset was
supplemented with recordings of 20 English-speaking individuals, including well-known actors,
journalists, and public figures. These audio materials were manually collected from open sources,
primarily video interviews published on YouTube.</p>
      <p>To prevent overlap between training and testing examples, an extended dataset was prepared.
Specifically, for each of the 70 speakers, an additional 20 unique audio recordings were created,
which did not duplicate the samples used for constructing reference embeddings. This ensured the
correctness of verification accuracy evaluation and the validity of experimental results.</p>
    </sec>
    <sec id="sec-5">
      <title>5. System Architecture and Verification Methodology</title>
      <p>In the first stage of the study, all audio files belonging to a single speaker were processed by each
of the selected neural networks: Pyannote, WavLM base-sv, WavLM base-plus-sv, and ECAPA. As
a result, 10 embeddings were generated and then averaged into a single vector—the speaker’s
reference embedding. Averaging reduces the impact of natural voice variations (intonation, tempo,
background noise). The resulting reference embeddings were stored in a database together with a
unique speaker identifier.</p>
      <p>
        In the next phase of the study, it was necessary to determine the optimal verification threshold.
To this end, a full authentication process was simulated to assess system behavior under real-world
conditions. The objective was to achieve a balanced trade-off between false negatives and false
positives. Within this simulation, pairs of recordings were formed to cover two scenarios:
verification of valid users, where test samples were compared with the reference embeddings of the
same individual; and verification of invalid users, where test embeddings were compared with the
reference vectors of other speakers [
        <xref ref-type="bibr" rid="ref24 ref25 ref26 ref27 ref28 ref29 ref30">24–30</xref>
        ].
      </p>
      <p>Once the reference database was created, the system proceeded to the verification stage. Using
the same neural model employed for reference construction, an extended set of voice samples from
each speaker was processed to extract test embeddings. Each test vector was then compared with
the corresponding reference embedding using cosine similarity. If the cosine distance value was
below a predefined threshold, verification was considered successful; otherwise, the sample was
not recognized as belonging to the claimed user (Figure 1).</p>
      <p>To assess the results of the experimental study, a set of commonly accepted metrics traditionally
used in biometric verification was applied. The most frequently considered indicators for
quantitative analysis of system performance are Accuracy, False Acceptance Rate (FAR), False
Rejection Rate (FRR), and the integral criterion Equal Error Rate (EER).</p>
      <p>Accuracy is a basic metric that reflects the overall proportion of correctly classified cases in the
binary speaker verification task. It is defined as the ratio of correct decisions (successful
authentications and correct rejections) to the total number of verification attempts. Despite its
intuitive clarity, this metric is sensitive to class imbalance and is therefore not considered decisive
in verification studies.</p>
      <p>False Acceptance Rate (FAR) characterizes the security of the system and represents the
proportion of cases in which an unauthorized user is incorrectly accepted as legitimate. Formally,
FAR is calculated as the ratio of false acceptances to the total number of access attempts made by
impostors. Minimizing FAR is critically important for improving system resilience to attacks and
ensuring reliability.</p>
      <p>False Rejection Rate (FRR), on the contrary, reflects system usability and measures the proportion
of cases in which a registered user is incorrectly rejected as unauthorized. FRR is calculated as the
ratio of false rejections to the total number of access attempts made by genuine users. A low FRR
ensures better user experience and higher system availability. Obviously, it is impossible to
minimize both FAR and FRR simultaneously. The optimal balance between them is achieved by
selecting an appropriate similarity threshold for embeddings, which determines the priority:
minimizing false acceptances or false rejections.</p>
      <p>
        Equal Error Rate (EER) serves as an integral indicator of biometric system quality, defined at the
point where FAR and FRR intersect. The lower the EER, the more effective the system is
considered. Due to its independence from a specific threshold value, this metric is widely used as a
universal criterion for comparing different biometric authentication algorithms [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Research Results</title>
      <p>A series of experiments was conducted to evaluate the scalability of the voice biometric
authentication system at five levels: 10, 20, 35, 50, and 70 registered users. For each level,
comparative testing was performed on four neural models—Pyannote, WavLM base-sv, WavLM
base-plus-sv, and ECAPA—followed by analysis of accuracy, false acceptance rate (FAR), false
rejection rate (FRR), and the equal error rate (EER).</p>
      <p>Table 1 presents the EER results obtained from the scalability experiments for the selected
models.</p>
      <p>Model</p>
      <p>Regardless of scale, ECAPA-TDNN demonstrates the best performance: the EER remains within
approximately 1.2–3.0% and consistently achieves the lowest value at each stage. Pyannote ranks
second (about 2.7–5.4%), showing a moderate increase in error with the growing number of users
and a slight improvement at 70 speakers. Both versions of WavLM exhibit the highest EER values
(approximately 7–11%). A general trend is observed (Figure 2): as the number of speakers increases,
the EER rises; however, partial stabilization occurs at the largest dataset size.</p>
      <p>Based on the constructed graph of accuracy versus the number of users, it was established that
the ECAPA-TDNN model demonstrates the highest stability and accuracy across all configurations.
Its accuracy remains in the range of 95–98%, and even under maximum scaling (70 users) it
maintains a high level of performance. Pyannote also showed stable accuracy, only slightly behind
ECAPA-TDNN, with a mostly uniform degradation as the number of users increased.</p>
      <p>The WavLM base-sv and WavLM base-plus-sv models demonstrated somewhat lower accuracy.
A particularly noticeable decline was observed after exceeding the threshold of 35 users. This may
indicate that, for these models, the increasing number of speakers complicates the discrimination of
embeddings, leading to a higher number of system errors.</p>
      <p>However, an interesting phenomenon was observed in the final iteration (70 users): the
accuracy of Pyannote, WavLM base-sv, and WavLM base-plus-sv increased compared to the
previous stage, breaking the general trend of degradation. This can be attributed to the larger pool
of speakers, which allowed the models to establish better threshold values and reduce both false
acceptances and false rejections.</p>
      <p>Overall, the observed trend confirms the classical principle: as the number of users in the
system increases, the task of distinguishing between voice embeddings becomes more complex,
which can lead to higher error rates and reduced accuracy.</p>
      <p>A histogram was constructed based on three key metrics: False Acceptance Rate (FAR) (red),
False Rejection Rate (FRR) (blue), and Equal Error Rate (EER) (green). This visualization enabled a
deeper analysis of model behavior under scalability conditions. The results provide a
comprehensive assessment of the scalability and robustness of different architectures in the
speaker verification task.</p>
      <p>ECAPA-TDNN confirmed its high effectiveness: it maintained minimal EER values across all
scalability levels and demonstrated a balanced trade-off between FAR and FRR. This indicates
optimal threshold calibration and the model’s ability to generate discriminative embeddings that
remain robust as the user base grows. Such characteristics make ECAPA-TDNN the most suitable
choice for real-world deployment in systems with strict requirements for performance and
security.</p>
      <p>Pyannote also showed relatively low error rates; however, at higher scales (50–70 users), a
gradual increase in FRR was observed. This reflects a reduced ability of the model to correctly
recognize legitimate users as the database expands. Despite this, the overall error levels remain
acceptable, making Pyannote suitable for medium-scale applications.</p>
      <p>In contrast, WavLM base-sv revealed significant limitations: elevated FAR values indicate a
higher probability of false acceptances. Such behavior is critically dangerous in security-sensitive
scenarios, such as banking services or government registries, where even isolated false acceptances
can have severe consequences.</p>
      <p>The WavLM base-plus-sv variant slightly reduced EER compared to the base version but did not
demonstrate stability comparable to ECAPA or Pyannote. As the number of users increased, FRR
rose substantially, indicating the model’s declining ability to reliably recognize legitimate speakers.
This significantly reduces usability and can negatively affect user experience.</p>
      <p>The overall trend shows that scaling the user base is not a neutral factor across all models: the
most robust architectures (such as ECAPA) maintain stable or even improved EER values due to
better alignment of embeddings, while weaker models exhibit simultaneous growth of both FAR
and FRR.</p>
      <p>Thus, the results confirm that the scalability of biometric systems directly depends on the choice
of model architecture. Powerful solutions such as ECAPA provide high tolerance to user base
expansion, while other architectures require additional measures—embedding optimization,
adaptive threshold selection, or even retraining on representative data. This underscores the need
not only for careful model selection but also for the application of comprehensive methods to
maintain scalability in the development of real-world voice biometric systems.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>This study presented an experimental comparison of four neural models for voice authentication in
order to evaluate their scalability. It was established that ECAPA-TDNN demonstrates the highest
stability and the lowest EER values across all user levels, confirming its suitability for large-scale
real-time systems. The Pyannote model also maintains acceptable accuracy, although it is
characterized by a gradual increase in FRR as the database expands. In contrast, the WavLM
architectures revealed scalability limitations: the base-sv variant exhibited elevated FAR, while the
base-plus-sv version showed instability in FRR.</p>
      <p>The findings indicate that maintaining high system performance and reliability requires the use
of compact embeddings, indexed search, and model optimization methods such as quantization and
knowledge distillation. The practical significance of the results lies in their applicability for the
development of scalable biometric services in domains such as banking, call centers, and voice
assistants.</p>
      <p>Declaration on Generative AI
While preparing this work, the authors used the AI programs Grammarly Pro to correct text
grammar and Strike Plagiarism to search for possible plagiarism. After using this tool, the authors
reviewed and edited the content as needed and took full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Desplanques</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thienpondt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Demuynck</surname>
          </string-name>
          , ECAPA-TDNN:
          <article-title>Emphasized Channel Attention, Propagation and Aggregation in TDNN-based Speaker Verification</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2020</year>
          ,
          <fpage>3830</fpage>
          -
          <lpage>3834</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2020-
          <fpage>2650</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nagrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Zisserman,
          <article-title>VoxCeleb: Large-Scale Speaker Verification in the wild</article-title>
          ,
          <source>Comput. Speech Lang</source>
          .,
          <volume>60</volume>
          (
          <year>2020</year>
          )
          <fpage>101027</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Okabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Koshinaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shinoda</surname>
          </string-name>
          , Attentive Statistics Pooling for Deep Speaker Embedding,
          <source>in: Interspeech</source>
          <year>2018</year>
          ,
          <fpage>2252</fpage>
          -
          <lpage>2256</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2018-
          <fpage>993</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <source>Exploring wav2vec 2.0 on Speaker Verification and Language Identification</source>
          ,
          <source>in: Interspeech</source>
          <year>2021</year>
          ,
          <fpage>1509</fpage>
          -
          <lpage>1513</lpage>
          .doi:
          <volume>10</volume>
          .21437/Interspeech.2021-
          <fpage>1280</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lakhno</surname>
          </string-name>
          , et al.,
          <article-title>Management of Information Protection based on the Integrated Implementation of Decision Support Systems</article-title>
          , East.-
          <string-name>
            <surname>Eur</surname>
          </string-name>
          . J.
          <string-name>
            <surname>Enterp</surname>
          </string-name>
          . Technol.,
          <volume>5</volume>
          (
          <issue>9</issue>
          )(89) (
          <year>2017</year>
          )
          <fpage>36</fpage>
          -
          <lpage>41</lpage>
          . doi:
          <volume>10</volume>
          .15587/
          <fpage>1729</fpage>
          -
          <lpage>4061</lpage>
          .
          <year>2017</year>
          .111081
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vakhula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Opirskyy</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Mykhaylova,</surname>
          </string-name>
          <article-title>Research on Security Challenges in Cloud Environments and Solutions based on the “Security-as-Code” Approach, in: Cybersecurity Providing in Information and Telecommunication Systems II</article-title>
          , vol.
          <volume>3550</volume>
          ,
          <year>2023</year>
          ,
          <fpage>55</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Susukailo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Opirsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Yaremko</surname>
          </string-name>
          ,
          <article-title>Methodology of ISMS Establishment against Modern Cybersecurity Threats</article-title>
          , in: Lect. Notes Electr. Eng., Springer, Cham,
          <year>2021</year>
          ,
          <fpage>257</fpage>
          -
          <lpage>271</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -92435-5_
          <fpage>15</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Opirskyy</surname>
          </string-name>
          , et al.,
          <source>Modern Methods of Ensuring Information Protection in Cybersecurity Systems using Artificial Intelligence</source>
          and
          <string-name>
            <given-names>Blockchain</given-names>
            <surname>Technology</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Harasymchuk</surname>
          </string-name>
          (Ed.), Technology
          <string-name>
            <surname>Center</surname>
            <given-names>PC</given-names>
          </string-name>
          , Kharkiv,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .15587/
          <fpage>978</fpage>
          -617-8360-12-2
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <source>ASVspoof</source>
          <year>2019</year>
          :
          <article-title>A Large-Scale Public Database of Spoofed Speech for Speaker Verification, Comput</article-title>
          . Speech Lang.,
          <volume>60</volume>
          (
          <year>2020</year>
          )
          <fpage>101027</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>Implementing Biometrics for Large-Scale Applications: Overcoming 6 Challenges, Biostatistics</article-title>
          .io (
          <year>2023</year>
          ). https://biostatistics.io/qa/implementing
          <article-title>-biometrics-for-large-scaleapplications-overcoming-6-challenges</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ruda</surname>
          </string-name>
          ,
          <source>Study of the Scalability of Biometric Authentication Systems based on Voice Embeddings, Soc. Dev</source>
          . Secur.,
          <volume>15</volume>
          (
          <issue>1</issue>
          ) (
          <year>2025</year>
          )
          <fpage>161</fpage>
          -
          <lpage>170</lpage>
          . doi:
          <volume>10</volume>
          .33445/sds.
          <year>2025</year>
          .
          <volume>15</volume>
          .1.
          <fpage>15</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Brydinskyi</surname>
          </string-name>
          , et al.,
          <source>Comparison of Modern Deep Learning Models for Speaker Verification, Appl. Sci.</source>
          ,
          <volume>14</volume>
          (
          <issue>4</issue>
          ) (
          <year>2024</year>
          )
          <article-title>1329</article-title>
          . doi:
          <volume>10</volume>
          .3390/app14041329
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thienpondt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Demuynck</surname>
          </string-name>
          ,
          <article-title>ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings</article-title>
          , in: IEEE Autom.
          <source>Speech Recognit. Underst. Workshop (ASRU</source>
          <year>2023</year>
          ), Taipei, Taiwan,
          <year>2023</year>
          ,
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/ASRU57964.
          <year>2023</year>
          .10389750
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , L. Deng,
          <article-title>Dense-Fusion2Net: A More Efficient and Lightweight Short Speech Speaker Recognition System with Time-Frequency Channel Attention</article-title>
          ,
          <source>Sci. Rep</source>
          .,
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <article-title>9601</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41598-025-93873-x
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , et al.,
          <source>Milestones in Speaker Recognition, Artif. Intell. Rev.</source>
          ,
          <volume>57</volume>
          (
          <year>2024</year>
          )
          <article-title>58</article-title>
          . doi:
          <volume>10</volume>
          .1007/s10462-023-10688-w
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition</article-title>
          ,
          <source>IEEE Trans. Dependable Secure Comput.</source>
          ,
          <volume>20</volume>
          (
          <issue>5</issue>
          ) (
          <year>2023</year>
          )
          <fpage>3970</fpage>
          -
          <lpage>3987</lpage>
          . doi:
          <volume>10</volume>
          .1109/TDSC.
          <year>2022</year>
          .3220673
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Learning Domain-Heterogeneous Speaker Recognition Systems with Personalized Continual Federated Learning</article-title>
          , EURASIP J.
          <article-title>Audio Speech Music Process</article-title>
          .,
          <volume>33</volume>
          (
          <year>2023</year>
          ).
          <source>doi:10.1186/s13636-023-00299-2</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>RudderAnalytics</surname>
          </string-name>
          ,
          <article-title>Building a Robust Speaker Verification System for Secure Voice Authentication</article-title>
          ,
          <string-name>
            <surname>Medium</surname>
          </string-name>
          (
          <year>2023</year>
          ). https://medium.com/@rudderanalytics/voice
          <article-title>-based-securityimplementing-a-robust-speaker-verification-system-12c5fd98f1c1</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharif-Noughabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Razavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamadzadeh</surname>
          </string-name>
          ,
          <article-title>Improving the Performance of Speaker Recognition System using Optimized VGG Convolutional Neural Network and Data Augmentation, Int</article-title>
          . J. Eng.,
          <volume>38</volume>
          (
          <issue>10</issue>
          ) (
          <year>2025</year>
          )
          <fpage>2414</fpage>
          -
          <lpage>2425</lpage>
          . doi:
          <volume>10</volume>
          .5829/ije.
          <year>2025</year>
          .
          <volume>38</volume>
          .
          <year>10a</year>
          .
          <fpage>17</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>On-Device Speech Processing Makes Alexa Faster</surname>
          </string-name>
          , Lower Bandwidth,
          <source>Amazon Sci. Blog</source>
          (
          <year>2023</year>
          ). https://www.amazon.science/blog/on-device
          <article-title>-speech-processing-makes-alexa-faster-lowerbandwidth</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <article-title>An Overview of Speech Recognition Techniques, Google</article-title>
          <string-name>
            <surname>Res.</surname>
          </string-name>
          (
          <year>2023</year>
          ). https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42535.pdf
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
             
            <surname>Petriv</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
           Opirskyy,
          <string-name>
            <given-names>N.</given-names>
             
            <surname>Mazur</surname>
          </string-name>
          , Modern Technologies of Decentralized Databases, Authentication, and
          <article-title>Authorization Methods, in: Cybersecurity Providing in Information and Telecommunication Systems II</article-title>
          , vol.
          <volume>3826</volume>
          ,
          <year>2024</year>
          ,
          <fpage>60</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <article-title>ua-polit-tiny, Hugging Face - the AI Community Building the Future (</article-title>
          <year>2023</year>
          ). https://huggingface.co/datasets/vbrydik/ua-polit-tiny
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Iosifova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sokolov</surname>
          </string-name>
          ,
          <article-title>Sentence Segmentation from Unformatted Text using Language Modeling and Sequence Labeling Approaches</article-title>
          , in: 7th
          <source>International Scientific and Practical Conference Problems of Infocommunications. Science and Technology</source>
          (
          <year>2020</year>
          )
          <fpage>335</fpage>
          -
          <lpage>337</lpage>
          . doi:
          <volume>10</volume>
          .1109/PICST51311.
          <year>2020</year>
          .9468084
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>O.</given-names>
            <surname>Iosifova</surname>
          </string-name>
          , et al.,
          <source>Analysis of Automatic Speech Recognition Methods, in: Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>2923</volume>
          (
          <year>2021</year>
          )
          <fpage>252</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          , et al.,
          <source>Transferability Evaluation of Speech Emotion Recognition between Different Languages, Advances in Computer Science for Engineering and Education</source>
          <volume>134</volume>
          (
          <year>2022</year>
          )
          <fpage>413</fpage>
          -
          <lpage>426</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -04812-8_
          <fpage>35</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romanovskyi</surname>
          </string-name>
          , et al.,
          <article-title>Prototyping Methodology of End-to-End Speech Analytics Software</article-title>
          ,
          <source>in: 4th Int. Workshop on Modern Machine Learning Technologies and Data Science</source>
          , vol.
          <volume>3312</volume>
          (
          <year>2022</year>
          )
          <fpage>76</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romanovskyi</surname>
          </string-name>
          , et al.,
          <source>Automated Pipeline for Training Dataset Creation from Unlabeled Audios for Automatic Speech Recognition</source>
          , Advances in Computer Science for Engineering and
          <string-name>
            <surname>Education</surname>
            <given-names>IV</given-names>
          </string-name>
          , vol.
          <volume>83</volume>
          (
          <year>2021</year>
          )
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -80472-
          <issue>5</issue>
          _
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>I. Iosifov</surname>
          </string-name>
          , еt al.,
          <article-title>Natural Language Technology to Ensure the Safety of Speech Information, in: Cybersecurity Providing in In-formation and Telecommunication Systems</article-title>
          , vol.
          <volume>3187</volume>
          , no.
          <issue>1</issue>
          (
          <year>2022</year>
          )
          <fpage>216</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>O.</given-names>
             
            <surname>Romanovskyi</surname>
          </string-name>
          , et al.,
          <article-title>Accuracy Improvement of Spoken Language Identification System for Close-Related Languages, Advances in Computer Science for Engineering and Education VII, vol</article-title>
          .
          <volume>242</volume>
          (
          <year>2025</year>
          )
          <fpage>35</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -84228-
          <issue>3</issue>
          _
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <article-title>Defining the Core Accuracy Metrics of Biometric Systems</article-title>
          , Alice Biometrics (
          <year>2023</year>
          ). https://alicebiometrics.com/en/defining
          <article-title>-the-core-accuracy-metrics-of-biometric-systems</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>