Integrated System for Speaker Diarization
                         and Intruder Detection using Speaker Embeddings
                         Illia Zaiets1, Vitalii Brydinskyi1, Dmytro Sabodashko1, Yurii Khoma1,
                         and Khrystyna Ruda1
                         1 Lviv Polytechnic National University, Bandery 12, Lviv, 79013, Ukraine


                                          Abstract
                                          This paper explores the use of diarization systems which employ advanced machine learning
                                          algorithms for the precise detection and separation of different speakers in audio recordings
                                          for the implementation of an intruder detection system. Several state-of-the-art diarization
                                          models including Nvidia’s NeMo, Pyannote, and SpeechBrain are compared. The performance
                                          of these models is evaluated using typical metrics used for the diarization systems, such as
                                          Diarization Error Rate (DER) and Jaccard Error Rate (JER). The diarization system was tested
                                          on various audio conditions, including noisy environment, clean environment, low amount of
                                          speakers, and high amount of speakers. The findings reveal that Pyannote delivers superior
                                          performance in terms of diarization accuracy, and thus was used for implementation of the
                                          intruder detection system. This system was further evaluated on a custom dataset based on
                                          Ukrainian podcasts, and it was found that the system performed with 100% recall and 93.75%
                                          precision, meaning that the system has not missed any criminal from the dataset, but could
                                          sometimes falsely detect a non-criminal as a criminal. This system proves to be effective and
                                          flexible in intruder detection tasks in audio files with different file sizes and different amounts
                                          of speakers that are present in these audio files.

                                          Keywords 1
                                          Deep learning, diarization, speaker embeddings, cyber security, intruder detection

                         1. Introduction                                                                                        overlapping voices, allowing the system to be
                                                                                                                                tested in a variety of conditions and is an ideal
                         Nowadays digital technologies are changing the                                                         testing environment.
                         world around us at an incredible speed and we                                                             Particular attention was paid to how the
                         are faced with a huge amount of information to                                                         systems handled the most common challenges in
                         process every day. This poses a challenge for                                                          modern audio, such as noise, overlapping voices,
                         many industries, especially cybersecurity and                                                          and varying speaker volumes. It was these
                         big audio data processing, where accurate and,                                                         challenging recordings that helped us select the
                         most importantly, timely data analysis becomes                                                         best library for the system.
                         a key success factor. This paper dives into this                                                          We have developed a methodology for testing
                         topic by proposing the development of a speech                                                         and analyzing data to compare diarization
                         diarization system based on state-of-the-art                                                           libraries, allowing us not only to assess the
                         machine learning libraries to effectively detect                                                       accuracy of recognition, but also to understand
                         intruders by their voices [1–5].                                                                       the strengths and challenges of each system and
                            To ensure the effectiveness and accuracy of                                                         how best to use them.
                         the developed diarization system, the                                                                     To evaluate the diarization libraries, we used
                         VoxConverse dataset was used. This dataset                                                             metrics that help to objectively assess the
                         contains a wide range of audio recordings, from                                                        accuracy and reliability of each library: DER and
                         single speeches to complex discussions with                                                            JER.

                         CPITS-2024: Cybersecurity Providing in Information and Telecommunication Systems, February 28, 2024, Kyiv, Ukraine
                         EMAIL illia.zaiets.mkbas.2022@lpnu.ua (I. Zaiets); vitalii.a.brydinskyi@lpnu.ua (V. Brydinskyi); dmytro.v.sabodashko@lpnu.ua
                         (D. Sabodashko); yurii.v.khoma@lpnu.ua (Y. Khoma); khrystyna.s.ruda@lpnu.ua (K. Ruda)
                         ORCID: 0009-0007-0754-0463 (I. Zaiets); 0000-0001-8583-9785 (V. Brydinskyi); 0000-0003-1675-0976 (D. Sabodashko); 0000-0002-4677-
                         5392 (Y. Khoma); 0000-0001-8644-411X (K. Ruda)
                                      ©️ 2024 Copyright for this paper by its authors.
                                      Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

                                      CEUR Workshop Proceedings (CEUR-WS.org)

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                                      228
The results of the evaluation of the diarization           • Detecting fraud attempts in telephone
libraries were used to select the best one that               calls, for example, by identifying
would be suitable for building a system                       inconsistencies in voices or attempts at
capable of accurately and efficiently                         manipulation.
identifying and separating speakers in audio               • Automate the process of distributing and
recordings to detect intruders. Ultimately, the               analyzing audio files, simplifying the work
Pyannote library was selected as a key element                of analysts.
of the system difications, which is determined             • Protecting      the     confidentiality   of
by the comprehensive security system of                       information by “monitoring” audio
multi-level information technology [6, 7].                    communications in large organizations to
                                                              ensure that confidential information is not
   2. Materials and Methods                                   disclosed.
                                                           Overall, audio diarization plays an important
Audio diarization technology detects and               role in cybersecurity, helping to detect and
distinguishes individual speakers in audio             prevent fraud, crime, and other cyber threats.
recordings. This technology is based on the                Such an approach plays an important role in
complex analysis of voice data, using machine          protecting information and identifying
learning and deep learning algorithms to               potential threats, which is especially relevant
recognize the voice characteristics of each            in the context of the growing number of
speaker to identify the individuals involved in a      cyberattacks and fraudulent activities.
conversation. This includes analyzing tone of              However, the implementation of audio
voice, speaking speed, accents, and other unique       diarization in the context of cybersecurity
features that distinguish one speaker from             faces several challenges. One of the main ones
another. Its main task is to divide the audio          is the presence of background noise in audio
stream into separate segments so that each             recordings, which can significantly complicate
segment represents the moment when one                 the process of speaker recognition. To solve
person speaks or when there is a change of             this problem, various methods of filtering and
speakers.                                              cleaning the audio signal are used. Another
   This process seeks to answer the question:          important aspect is the variability of speech
“Who speaks when?” throughout the audio                features, such as accents, intonations, and
recording [8]. Thanks to this, the analysis of         speech speed. This requires audio diarization
audio materials becomes much easier, especially        systems to be highly flexible and able to adapt
in situations where there are many participants        to a variety of conditions.
in a conversation and their voices often overlap           Audio diarization involves several critical
or change each other. Thus, audio diarization is       steps to accurately recognize and separate
becoming       an    indispensable        tool  for    speakers in audio recordings. These stages
understanding and analyzing complex audio              include Voice Activity Detection, Overlapped
recordings, particularly in the context of             Speech Detection, Speaker Change Detection,
cybersecurity and other areas where speaker            Segmentation, Speaker Embedding Extraction,
identification accuracy is critical [9]. Examples of   Clustering, and Neural Diarizer.
how diarization is used:                                   The diarization pipeline is shown in Fig. 1
    • Identify different speakers in an audio file.
       For cybersecurity investigations, where it
       is important to understand who exactly
       participated in the conversation.
    • Analyzing communications, such as
       intercepted phone calls or meeting notes,
       can help identify suspicious or malicious
       activity.
    • Detecting fraud attempts in telephone
                                                       Figure 1: Diarization structure containing its
       calls, for example, by identifying
                                                       main steps
       inconsistencies in voices or attempts at
       manipulation.


                                                   229
Automatic        Speech      Recognition—Fig. 1       and effectively cluster audio segments
demonstrates that the ASR process can be used         according to speakers.
in parallel with diarization, if necessary [10].         Thanks to neural networks, the diarization
    Voice Activity Detection—the first stage is       process becomes more accurate and flexible.
voice activity detection, where the system            Neural diarizers can efficiently process a large
determines whether a voice signal is present in a     amount of audio data and are also better able
certain audio segment. This enables users to          to cope with challenges such as overlapping
filter out quiet areas or noise, focusing solely on   voices or changing recording conditions [16].
segments with voice activity such as speech [11].        The key aspect of audio diarization is the
    Speaker Change Detection & Overlapped             use of specific machine learning algorithms.
Speech Detection—the process includes                 Neural networks, for example, work effectively
speaker change detection and detection of             with the task of classifying audio segments by
moments when several speakers are speaking            speakers. Clustering helps in grouping similar
at the same time. The system analyzes the             speech features, which makes it easier to
audio stream to detect the moments when one           identify     individual      speakers.    Speaker
speaker finishes speaking and another starts.         recognition is important for determining who
This helps to divide the audio into segments,         is speaking at a particular moment in a
each of which reflects the speech of a particular     recording.
speaker. At this stage, the diarization process          Data processing is also an integral part of
faces several challenges. First, the quality of       the process. Extracting features from audio,
the audio recording can vary considerably, and        such as tone, speech rate, and other
noise, echo, and other audio interference can         characteristics, is critical to the accuracy of the
make it difficult to identify speakers. Secondly,     algorithms. Data normalization ensures the
taking into account a variety of speech               homogeneity of the input data, which helps in
features, including accents, dialects, and            improving the accuracy of machine learning
intonations, is an important aspect of ensuring       models.
accurate diarization. Third, voice overlap,              When it comes to optimizing and tuning
when multiple people speak at the same time,          models, it is important to choose the right
presents a challenge for accurate speaker             hyperparameters to maximize the efficiency of
segmentation and identification [12].                 the diarization. Optimization strategies include
    Segmentation then segmentation takes              choosing the model architecture, adjusting the
place, where the audio recording is divided           learning rate, and other parameters that can
into smaller parts for detailed analysis. Each        affect the result.
segment is checked for unique features of the            Last but not least, the results are analyzed
speaker [13].                                         and validated. This includes evaluating the
    Speaker Embedding Extraction is one of            accuracy of the algorithms on different datasets
the most important stages in the extraction of        and analyzing errors, which allows for further
speaker characteristics. The system identifies        improvement of the diarization methods.
unique voice attributes, such as timbre, tempo,          The tools used, including PyAnnote [17],
and intonation, which allows the creation of a        NVIDIA NeMo [18], and SpeechBrain [19], have
unique “fingerprint” of each speaker [14].            a variety of functionalities for complex speech
    Clustering at this stage, clustering takes        analysis, speaker identification, and speaker
place, where segments with similar                    diarization.
characteristics are organized together. This             These are three of the main and most
allows the system to recognize and group              popular Python libraries used in diarization
segments belonging to the same speaker [15].          tasks. PyAnnote is great for automatic audio
    Neural Diarizer is the final stage that           annotation and speaker identification. NVIDIA
involves the use of neural diarizers. Neural          NeMo offers powerful tools for working with
diarizers are deep learning-based systems that        neural networks, which is ideal for complex
can automatically identify different speakers in      diarization tasks. SpeechBrain, with its
complex audio recordings. They use powerful           flexibility and open-source nature, is another
neural networks to analyze audio signals, pick        great tool for speech processing and diarization.
up subtle differences between different voices,       In general, the use of machine learning
                                                      algorithms in audio diarization tasks is an


                                                  230
important step towards the development of          In general, Pyannote is an impressive audio
speech processing technologies, offering more      diarization solution that continues to evolve
accurate and efficient solutions.                  and find new applications in a variety of areas,
                                                   for both research and commercial use. This
     3. Aim of Research                            library is especially valuable for its ability to
                                                   perform complex audio processing tasks,
                                                   providing reliable and accurate results.
The main goal of this article is to implement a
system that can accurately recognize and
separate the voices of speakers in audio           b.    Nvidia NeMo
recordings and detect intruders. This is of
great importance not only for information          NVIDIA NeMo, which stands for Neural
security, but also for other areas where it is     Modules, is an innovative approach to machine
important to analyze speech accurately. To         learning and audio analysis. This library,
achieve this goal, we analyzed the capabilities    created by NVIDIA, specializes in applying
of libraries such as Pyannote, NVIDIA NeMo,        deep learning to a variety of speech-processing
and SpeechBrain, and built a diarization           tasks, including audio diarization.
system capable of intruder detection based on         NeMo’s special feature is its modular
this analysis.                                     architecture, which allows researchers and
                                                   developers to easily create, customize, and
                                                   optimize different components of neural
     4. Models Overview                            networks for specific tasks. This makes NeMo
a.     PyAnnote                                    not only a powerful tool for machine learning
                                                   experts but also an accessible solution for a
Pyannote represents an important direction in      wider range of users who may not have deep
the development of audio diarization               knowledge in this area.
algorithms. It is an open-source tool developed       In the context of audio diarization, NeMo
for audio data processing, especially focused      uses advanced deep learning algorithms to
on diarization tasks.                              efficiently recognize and separate speakers in
   The main advantage of Pyannote is its           audio recordings. With its high accuracy and
flexibility and high accuracy, provided by the     ability to process complex audio data, NeMo is
use of deep learning algorithms. It uses neural    becoming an important tool in tasks that
networks to analyze audio recordings, detect       require recognizing different voices, even in
speaker identification features or embeddings,     the presence of noise or overlapping voices.
and separate and classify them. This allows           NVIDIA NeMo is also continuously updated
Pyannote to efficiently separate audio             to include the latest advances in machine
recordings into segments, each corresponding       learning and speech processing. This ensures
to a specific speaker, even in difficult           that users have access to the most advanced
conditions where voices overlap or there is        technologies to solve their problems.
background noise.                                      In summary, NVIDIA NeMo plays a
   In addition to diarization, Pyannote also       significant role in today’s audio diarization
provides tools for other audio processing          process by offering flexible, scalable, and high-
tasks, such as voice activity detection, and       performance solutions for a variety of research
gender and age recognition, making it a            and commercial applications in different fields,
multifunctional solution.                          including cybersecurity.
   Another important aspect of Pyannote is its
community and open nature. Developers and
                                                   c.    SpeechBrain
researchers        can     contribute      their
improvements and adaptations, which
                                                   SpeechBrain is another important player in the
contributes to the constant updating and
                                                   field of machine learning algorithms for audio
improvement of the tool. This also means that
                                                   diarization. This open-source tool was developed
the library is constantly adapting to new
                                                   as a one-stop solution for a variety of speech-
challenges and technological breakthroughs in
                                                   processing tasks, including audio diarization,
audio data processing.
                                                   speech recognition, and speech synthesis.


                                               231
SpeechBrain is flexible, allowing users to easily       duration of speech falsely detected as non-
customize and adapt the system to their specific        speech in an audio file, and Tc is the duration of
needs. The use of deep learning algorithms              speaker confusion in an audio file.
allows SpeechBrain to efficiently process                  Jaccard Error Rate (JER) is an error that
complex audio recordings and accurately                 determines how often the speakers are falsely
separate the speech of different speakers by            detected as other speakers. This metric is
returning their embeddings.                             based on the Jaccard index, which measures
    One of the advantages of SpeechBrain is its         the similarity between the sets.
ability to handle large amounts of data, making it         Jaccard error rate can be calculated using
an ideal solution for processing audio recordings       the following equation:
on the scale required today. Also important is its                         ∑𝑁𝑖=0|𝑆𝑖 ∩ 𝐷𝑖 |         (2)
ability to adapt to different recording conditions,             𝐽𝐸𝑅 = 1 − 𝑁               ,
                                                                           ∑𝑖=0|𝑆𝑖 ∪ 𝐷𝑖 |
including different languages, accents, and sound          where Si is the set of the segments for a
quality.                                                speaker i in the test dataset, Di is the set of the
    SpeechBrain is also characterized by its            segments where speaker i was predicted, and
openness, which facilitates a community of              N is the total number of speakers.
researchers and developers to work together to
improve and adapt the tool. This creates a
                                                        b.    Initial Analysis
dynamic environment for innovation and
development in the field of speech processing.
    Overall, SpeechBrain offers a feature-rich          Before starting extensive testing on massive data
solution for audio diarization tasks, providing         from various Python libraries, we first conducted
high accuracy, flexibility, and scalability, which is   experiments with the Pyannote algorithm using
important for a wide range of applications, from        audio recordings with a wide range of conditions.
scientific research to commercial audio                 This included files with different numbers of
processing projects.                                    participants, variations in noise levels, and
                                                        different degrees of speech overlap. This way, we
                                                        can better understand Pyannote’s performance
     5. Experiment Setup                                and reliability in different acoustic scenarios,
                                                        which is critical for further analysis of larger
A high-performance NVIDIA RTX 3090                      data.
graphics card was chosen to effectively solve               Initially, a two-minute audio file was selected
the tasks of audio diarization. This choice was         for analysis, which was a recording of a news
made due to its high computing power and                broadcast. The peculiarity of this recording was
optimization for deep learning tasks, which is          the high level of background noise, although
critical for the efficient processing and analysis      there were no overlapping audio tracks. This
of large amounts of audio data.                         choice allowed us to evaluate how efficiently the
                                                        algorithm can process audio with complex sound
a.    Metrics                                           environment conditions, not complicated by the
                                                        simultaneous speech of several speakers.
Diarization Error Rate (DER)—error of                       The audio file with two speakers and
detecting the segment’s boundaries and overlaps         background noise is visualized in Fig. 2.
in the audio recording considering the true or
false assignment of speaker identifier to the
audio recording segment. This error is the main
for diarization and is to be the generally accepted
metric in commercial systems.
    Diarization error rate can be calculated
using the following equation:
                  𝑇𝑎 + 𝑇𝑚 + 𝑇𝑐              (1)         Figure 2: Timeline of an audio file containing
          𝐷𝐸𝑅 =                  ,
                         𝑇                              two speakers with background noise
    where T is the total duration of an audio file,
Ta is the duration of non-speech falsely                The next step in the study was a more complex
detected as speech in an audio file, Tm is the          task for the Pyannote library. A 16-minute


                                                    232
audio recording of a conference with 11 people      Table 1
present at the same time was chosen for             Initial analysis results
analysis. This recording was characterized by a                   Conditions             DER    JER
significant level of noise, changes in speech            Noisy environment; 2 speakers   0.19   0.19
                                                         Noisy environment; 2 speakers   0.76   0.75
volume, and frequent interruptions between               Clean environment; 2 speakers   0.07   0.07
speakers. This made it possible to evaluate
Pyannote’s ability to effectively cope with the
high level of complexity in speech recognition         From the experiment results it can be seen
and speaker identification in multi-voice audio.    that the diarization system performed well on
   The audio file with eleven speakers with         the smaller amount of speakers, though a bit
background noise is visualized in Fig. 3.           worse when the noisy environment was
                                                    present. When it comes to diarization on the
                                                    bigger number of speakers in the noisy
                                                    environment, the system did not perform well,
                                                    so it is not recommended to use this system
                                                    with data, where there are a lot of speakers
                                                    with potential overlaps and a noisy
                                                    environment on top.

                                                    c.      Test Dataset for Model Selection

                                                    The VoxConverse dataset [20] was chosen to
Figure 3: Timeline of an audio file containing      evaluate popular Python diarization libraries.
eleven speakers                                     VoxConverse is an extensive dataset that was
                                                    created for speech diarization tasks and is a
The study was continued by selecting an audio       good resource for researchers and developers
recording of a clean speech with no                 in this field. This dataset includes a large
background noise, overlapping tracks, or            number of audio recordings that cover a wide
interruptions in speech. This recording is ~15      range of scenarios from public speeches and
minutes long and represents ideal conditions        interviews to newscasts and debates. A special
for analysis, which makes it possible to            feature of VoxConverse is the presence of
evaluate the algorithm’s performance under          recordings where speech overlap is observed,
optimal conditions. This choice allows you to       which is very typical in real-world settings and
establish a baseline level of accuracy of the       is of great interest for research.
diarization system under ideal conditions,              The audio recordings in VoxConverse are
without external interference.                      annotated with detailed labels that include
   The audio file with two speakers with a          time intervals and speaker identifiers. This
clean background is shown in Fig. 4 with            information is extremely valuable as it allows
background noise.                                   us to accurately assess how different
                                                    diarization algorithms and systems perform in
                                                    detecting and distributing speech among
                                                    different speakers. Such annotations are
                                                    important for comparing the results of
                                                    diarization systems with the “ideal” and
                                                    evaluating their effectiveness.
                                                        The large amount of data in VoxConverse
                                                    enables deep and comprehensive analysis.
Figure 4: Timeline of an audio file containing      This allows researchers to evaluate dialysis
two speakers with clear audio                       systems in a variety of settings, including
Table 1 contains the results of the initial study   scenarios with a variable number of speakers,
of the robustness of the diarization library        different noise levels, and different speech
Pyannote.                                           styles and accents. This diversity helps to
                                                    improve the reliability and accuracy of dialysis


                                                 233
systems and contributes to the development of       evaluate which algorithm is the most efficient
more versatile and adaptive solutions.              in this parameter. This analysis will not only
    Thanks to its openness and accessibility,       help identify the most accurate voice
VoxConverse has become a valuable tool for          recognition system but also determine which
the community to conduct collaborative              one provides the best ratio of speed and quality
research and development in the field of            of data processing.
speech diarization. The use of such datasets           The model selection experiment results are
helps researchers identify new challenges that      presented in Table 2.
modern systems may face and develop more
                                                    Table 2
efficient algorithms for speech processing.
                                                    Model selection experiment results
    Out of the entire VoxConverse dataset of              Model      Elapsed time   DER     JER
464 records, the first 50 records of the dataset       SpeechBrain      3m 53s      0.31    0.31
were selected for the tests to reduce the time         NVIDIA NeMo     17m 32s      0.41    0.41
to perform the diarization and reduce resource          Pyannote        20m 7s      0.14    0.14

usage. These selected records have different
lengths, ranging from 3 to 20 minutes, which            SpeechBrain, an open-source machine
provides a wide range of conditions to evaluate     learning library, impressed with its processing
the performance of my chosen Python                 speed, taking only 3 minutes and 53 seconds,
machine-learning libraries. Not only does this      although accuracy leaves much to be desired
approach allow for a focus on detail, but it also   and additional tuning is required to achieve
provides practical relevance by demonstrating       optimal results. The average DER of 31% can
how systems adapt to variability in real-world      be considered satisfactory, given the openness
speech scenarios. This helps to gain a deeper       of the library and the complexity of the data.
understanding of each system's performance              NVIDIA’s NeMo took longer to process—17
in situations that may occur in real life and       minutes and 32 seconds—but showed good
identify     potential   areas    for    further    diarization results, especially given the
improvement.                                        complexity of the audio data. The average DER
                                                    of 14% indicates the efficiency of the
d.    Model Selection          for    Intruder      algorithm.
      Detection System                                  For the intruder detection system, it was
                                                    decided to use Pyannote, which showed the
For each of the selected libraries, we developed    best results in diarization of this dataset. With
the appropriate code, taking into account their     an average DER of 9%, Pyannote effectively
unique features. The goal was to ensure that        handles the challenges of the dataset. Despite
the final result in each of them complied with      the fact that Pyannote’s processing time was
the generally accepted RTTM standard for            20 minutes and 7 seconds, this is compensated
diarization timestamps. This methodology            by its high accuracy and high-quality
provided the ability to equally evaluate and        documentation, which allows users to quickly
compare the results obtained using different        get started with the library. Although the
diarization systems.                                processing time is not a decisive factor
   As part of the experiment, we used models        compared to NeMo, the time to implement and
of these libraries trained on the VoxConverse       configure Pyannote was significantly shorter.
dataset to evaluate their effectiveness in real-        Thus, given the speed, accuracy, and ease of
world conditions. The main goal of this             use, Pyannote was the choice for the final task,
experiment is to determine which of these           demonstrating an excellent balance between
libraries is best suited for the final task of      processing time and quality of results.
detecting an intruder.
   The following metrics were chosen to
evaluate the performance of each library:
average DER and JER. These metrics were
calculated based on 50 selected test recordings
from VoxConverse. We also took into account
the diarization time for each system to


                                                234
e.    Diarization for Intruder Detection            Intruder      Detection                 System
      Task                                          Implementation

Data Preparation                                    Moving forward, the subsequent phase in our
                                                    research involves the creation of a bespoke
Before developing and analyzing an intruder         algorithm. This algorithm is tailored
detection system, it is necessary to collect and    specifically to extract embeddings from audio
prepare data that will be used to train and test    recordings that contain the vocal patterns of
the model. This stage involves selecting            individuals labeled as “intruders”. The core
appropriate audio recordings and processing         process of this development entails the
them to ensure effective training of the system.    transformation of the distinctive vocal
After creating a reliable and representative        characteristics of each speaker into complex,
training sample, the next step is to develop a      high-dimensional numerical vectors. These
method for detecting and identifying potential      vectors are a crucial element as they
criminals in the database. The use of various       encapsulate the unique voice features in a
methods of speech diarization will allow us to      quantifiable form.
test the effectiveness of the system and ensure         The strategic utilization of these voice
its practical use in real-life scenarios.           embeddings plays a vital role in our study. It
    For the study, several episodes of a well-      enables a more refined and in-depth
known Ukrainian YouTube podcast were                comparison and analytical process. This is
selected, where two hosts are constantly            achieved by measuring the cosine distances
participating and different guests come to each     between these numerical vectors. By analyzing
episode. In the experiment, some guests were        these distances, we can ascertain with a high
conditionally labeled as “intruders”. Five          degree of precision whether a particular
separate three-minute audio recordings were         segment of speech can be attributed to a
created for each guest to extract their voice       specific “criminal” or another speaker. This
embeddings. The total size of the dataset for       methodology       is    highly    effective    in
identifying “intruders” is 56 recordings, each of   distinguishing between different voices in an
which is two to three minutes long. Out of this     audio recording. This approach is key to the
number, 15 recordings include the voices of         development of systems used in forensic
the identified “intruders”, while the remaining     research and other areas where it is necessary
41 do not. This provides a unique opportunity       to accurately identify a person by voice.
to evaluate how well the developed diarization          The next step in the research is to apply an
system performs in recognizing and separating       algorithm to collect embeddings from all
voices in real-life situations, which is key for    suspect recordings in the database. This
application in practical scenarios.                 process involves analyzing each audio file and
    The dataset used for this experiment can be     extracting the corresponding embeddings.
found here [21].                                    Once the embeddings are collected, they are
    The example of a prepared audio file for        clustered. This procedure allows you to group
intruder detection is shown in Fig. 5.              similar voice characteristics, which is key to
                                                    simplifying the subsequent identification
                                                    process. Clustering reduces the need to make
                                                    multiple comparisons between each segment’s
                                                    echo and all of the offender’s echoes, thereby
                                                    increasing the efficiency and accuracy of
                                                    identification. In addition, clustering helps to
                                                    identify common characteristics of the voices
                                                    of the “intruders”, which can help to accurately
                                                    identify potential suspects.
                                                        The processing of podcasts includes
                                                    downloading each audio file, running
Figure 5: Example of an audio file containing       diarization on this file, and then selecting only
an intruder’s voice                                 the segments larger than 5 seconds. This is done


                                                235
to ensure detailed analysis and accurate               identify the strengths and weaknesses of the
identification of the different voices in the          program, as well as possible areas for further
recording. Each segment is then checked against        improvement. Testing on a dataset will not only
the attacker’s speech patterns, which are              confirm the program’s ability to effectively
predefined and stored in a database. This              recognize intruder voices but will also provide
technique allows you to accurately identify the        valuable insight into its overall accuracy and
moments of the suspect’s presence in the audio         reliability in various use cases.
material and also provides the ability to identify         During the meticulous analysis of the dataset,
attackers who speak only in certain parts of the       the algorithm demonstrated a remarkable level
podcast. This is important to ensure high              of accuracy in identification tasks. Among the
identification accuracy without affecting              entirety of the audio files that were processed, it
processing speed.                                      is noteworthy that only a single file was
    Before comparing segments, you need to             erroneously classified as containing the voice of
specify a key threshold parameter that can             a criminal. This particular outcome may serve as
dramatically change the results of the study. The      an indicator of possible limitations inherent
threshold plays a crucial role in determining          within the algorithm itself, or it could
whether a voice in a podcast segment matches           alternatively point to specific characteristics of
the voice of a known intruder. It serves as a          the audio file that might have influenced its
measure for comparing the level of similarity          recognition capabilities. Significantly, all other
between voice embeddings. The key point is that        files within the dataset were identified with a
if the cosine distance between the segment’s           high degree of accuracy, a fact that robustly
embedding and the nearest intruder’s                   affirms the effectiveness and reliability of the
embedding is less than this threshold, the system      system we have developed.
recognizes the presence of an intruder in that             Moreover, this solitary instance of
segment.                                               misidentification, while being an outlier, is also of
    The intruder detection system is shown in          considerable value. It offers critical insights and
Fig. 6.                                                catalyzes further in-depth analysis and fine-
                                                       tuning of the algorithm. By closely examining this
                                                       case, we can gain a deeper understanding of the
                                                       algorithm’s current capabilities and limitations.
                                                       This understanding is instrumental in guiding
                                                       subsequent enhancements and optimizations.
                                                       Our goal is to refine the algorithm’s precision in
                                                       accurately distinguishing between the presence
                                                       of intruders and non-intruders across a diverse
                                                       range of audio scenarios. This ongoing process of
                                                       improvement is pivotal in ensuring that the
                                                       system remains highly efficient and effective in
                                                       various real-world applications.


Figure 6: Intruder detection system

Intruder Detection System Experiment

After the development of the main components
of the system is completed, the next stage will
be its launch and testing on a selected data set.      Figure 6: Intruder detection experiment
This will allow us to evaluate the functionality       confusion matrix
and efficiency of the developed system in real
conditions. An important part of this process          Drawing upon the data derived from the
will be the analysis of the results, which will help   confusion matrix, as illustrated in Figure 6, we


                                                   236
can compute several crucial algorithm                      application in detecting intruders. We evaluated
performance metrics, notably Accuracy,                     the resilience and effectiveness of these
Precision, Recall, and the F1-Score. These                 diarization systems across a spectrum of
metrics are indispensable as they furnish                  environmental conditions. Central to our study is
insightful details regarding the model’s                   the development of an innovative intruder
proficiency in precisely detecting intruders.              detection system, which is fundamentally based
Furthermore, they shed light on the model’s                on the principles and technology of speaker
dependability in minimizing instances of false             diarization. The results of our investigation
positives and false negatives. The attainment              reveal a notable compatibility of diarization
of high values in these metrics is a clear                 models within the realm of intruder detection,
indicator that the system we have developed is             particularly highlighted by their proficiency in
highly competent in its designated functions. It           identifying unauthorized individuals within
excels in identifying intruders with remarkable            audio recordings or live audio streams. A key
accuracy and is characterized by a minimal                 outcome of our experimental findings is the
occurrence of errors. This aspect of the                   discernible superiority of the Pyannote
system’s performance is not only a testament               diarization model. This model demonstrated
to its effectiveness but also highlights its               exceptional performance in diarization,
reliability in critical situations where the               evidenced by achieving the lowest DER at 14%
accurate detection of intruders is paramount.              and the lowest JER at 14% as well. Despite its
                                                           relatively slower inference time compared to
Table 3
                                                           other models, the accuracy and reliability it
Intruder detection experiment results
  Accuracy, %   Recall, %   Precision, %   F1-score, %
                                                           brings to intruder detection significantly
     98.21       100.00        93.75          96.77        outweigh this limitation.
                                                               In the development of an intruder detection
                                                           system, we chose to implement the Pyannote
    The conclusion of this audio diarization
                                                           diarization model as its core component. The
experiment revealed that high accuracy in
                                                           performance of the system was remarkably high,
detecting intruders is achievable, but requires
                                                           demonstrating an accuracy rate of 98.21%. This
careful tuning of the system to the characteristics
                                                           high level of accuracy was further complemented
of each audio recording. The key factors affecting
                                                           by a perfect recall rate of 100.0%, indicating that
the success of identification are the
                                                           every single intruder present in the dataset was
“min_segment_duration”                         and
                                                           successfully identified by the system.
“similarity_threshold” hyperparameters. Setting
                                                           Additionally, the system exhibited a precision of
the minimum segment duration helps to avoid
                                                           93.75%, which, although not flawless, is
misidentifying intruders, although it may result
                                                           significantly commendable. The F1-score, which
in missing their short utterances. On the other
                                                           is a balanced measure of the system’s precision
hand, fine-tuning the similarity threshold for
                                                           and recall, stood at an impressive 96.77%,
embeddings is important for accurately
                                                           underscoring the system’s overall efficacy.
recognizing the voices of intruders while
                                                               It is noteworthy, however, that a small
avoiding false positives. You should also pay
                                                           fraction of the speakers were incorrectly
attention to the timbre of the voice, as it can
                                                           classified as intruders. However, it is
significantly improve the results, especially
                                                           overshadowed by the system’s paramount
when voices with similar characteristics are
                                                           accomplishment: its unfailing ability to detect
present in the audio recording. Thus, an
                                                           every intruder included in the dataset. This
individualized approach to each audio file and its
                                                           aspect, above all, highlights the system’s value as
features is the key to effectively detecting
                                                           a reliable tool in intruder detection scenarios.
criminal activity in different audio contexts.

   6. Conclusions                                          References
                                                           [1]   O. Romanovskyi, et al., Prototyping
In this paper, we conducted a comprehensive
                                                                 Methodology of End-to-End Speech
analysis that compares various deep learning
                                                                 Analytics Software, in: 4th International
models specifically in the sphere of speaker
                                                                 Workshop on Modern Machine Learning
diarization, with a particular focus on their


                                                         237
      Technologies and Data Science, vol. 3312                Library, Sensors 23(4) (2023) 2082. doi:
      (2022) 76–86.                                           10.3390/s23042082.
[2]   I. Iosifov,      et al.,       Transferability   [10]   A. Hannun, et al., Deep Speech: Scaling up
      Evaluation of Speech Emotion Recog-                     end-to-end Speech Recognition, arXiv:
      nition Between Different Languages,                     preprint (2014).
      Advances in Computer Science for                 [11]   J. Ball, Voice Activity Detection (VAD) in
      Engineering and Education 134 (2022)                    Noisy Environments, ArXiv (2023).
      413–426.          doi: 10.1007/978-3-031-        [12]   S. Cornell, et al., Overlapped Speech
      04812-8_35.                                             Detection and Speaker Counting Using
[3]   I. Iosifov, O. Iosifova, V. Sokolov, Senten-            Distant Microphone Arrays, Comput.
      ce Segmentation from Unformatted Text                   Speech Lang. 72 (2022) 101306. doi:
      using Language Modeling and Sequence                    10.1016/j.csl.2021.101306.
      Labeling        Approaches,         in:    VII   [13]   M. Kotti, V. Moschou, C. Kotropoulos,
      International Scientific and Practical                  Speaker Segmentation and Clustering,
      Conference               Problems           of          Signal Process. 88(5) (2008) 1091–1124.
      Infocommunications.            Science    and           doi: 10.1016/j.sigpro.2007.11.017.
      Technology (2020) 335–337. doi:                  [14]   M. Jakubec, et al., Deep Speaker
      10.1109/PICST51311.2020.9468084.                        Embeddings for Speaker Verification:
[4]   O. Iosifova, et al., Analysis of Automatic              Review and Experimental Comparison,
      Speech Recognition Methods, in:                         Eng. Appl. Artif. Intell. 127 (2024)
      Workshop on Cybersecurity Providing in                  107232. doi: 10.1016/j.engappai.2023.
      Information and Telecommunication                       107232.
      Systems, vol. 2923 (2021) 252–257.               [15]   N. Dawalatabad, et al., ECAPA-TDNN
[5]   O. Iosifova,         et al.,       Techniques           Embeddings for Speaker Diarization,
      Comparison for Natural Language                         Proc. Interspeech (2021) 3560–3564.
      Processing, in: 2nd International                       doi: 10.21437/Interspeech.2021-941.
      Workshop on Modern Machine Learning              [16]   D. Garcia-Romero, et al., Speaker
      Technologies and Data Science, vol.                     Diarization Using Deep Neural Network
      2631, no. I (2020) 57–67.                               Embeddings,         IEEE     International
[6]   V. Dudykevych, H. Mykytyn, K. Ruda, The                 Conference on Acoustics, Speech and
      Concept of a Deepfake Detection System                  Signal Processing (ICASSP) (2017)
      of Biometric Image Modifications Based                  4930–4934. doi: 10.1109/ICASSP.2017.
      on Neural Networks, IEEE 3rd KhPI Week                  7953094.
      on Advanced Technology (KhPIWeek)                [17]   H. Bredin, Pyannote.Audio 2.1 Speaker
      (2022). doi: 10.1109/khpiweek57572.                     Diarization        Pipeline:     Principle,
      2022.9916378.                                           Benchmark, and Recipe, INTERSPEECH
[7]   Y. Shtefaniuk, I. Opirskyy, Comparative                 (2023) 1983–1987. doi: 10.21437/
      Analysis of the Efficiency of Modern Fake               interspeech.2023-105.
      Detection Algorithms in Scope of                 [18]   E. Harper, et al. NeMo: A Toolkit for
      Information        Warfare,       11th   IEEE           Conversational AI and Large Language
      International Conference on Intelligent                 Models.      URL:     https://github.com/
      Data      Acquisition        and    Advanced            NVIDIA/NeMo
      Computing Systems: Technology and                [19]   M. Ravanelli, et al., SpeechBrain: A
      Applications (2021) 207–211. doi: 10.                   General-Purpose Speech Toolkit, ArXiv
      1109/IDAACS53288.2021.9660924.1                         (2021).
[8]   X. Miro, et al., Speaker Diarization: A          [20]   J. Chung, et al, Spot the Conversation:
      Review of Recent Research, IEEE Trans.                  Speaker Diarisation in the Wild,
      Audio, Speech, Lang. Process. 20(2)                     INTERSPEECH (2020) 299–303. doi:
      (2012) 356–370. doi: 10.1109/tasl.                      10.21437/interspeech.2020-2337.
      2011.2125954.                                    [21]   I. Zaiets, Dataset of Ukrainian Podcasts
[9]   V. Khoma, et al., Development of                        for Intruder Detection by Voice (2024).
      Supervised Speaker Diarization System                   doi: 10.57967/hf/0701.
      Based on the PyAnnote Audio Processing


                                                   238