Integrated System for Speaker Diarization and Intruder Detection using Speaker Embeddings Illia Zaiets1, Vitalii Brydinskyi1, Dmytro Sabodashko1, Yurii Khoma1, and Khrystyna Ruda1 1 Lviv Polytechnic National University, Bandery 12, Lviv, 79013, Ukraine Abstract This paper explores the use of diarization systems which employ advanced machine learning algorithms for the precise detection and separation of different speakers in audio recordings for the implementation of an intruder detection system. Several state-of-the-art diarization models including Nvidia’s NeMo, Pyannote, and SpeechBrain are compared. The performance of these models is evaluated using typical metrics used for the diarization systems, such as Diarization Error Rate (DER) and Jaccard Error Rate (JER). The diarization system was tested on various audio conditions, including noisy environment, clean environment, low amount of speakers, and high amount of speakers. The findings reveal that Pyannote delivers superior performance in terms of diarization accuracy, and thus was used for implementation of the intruder detection system. This system was further evaluated on a custom dataset based on Ukrainian podcasts, and it was found that the system performed with 100% recall and 93.75% precision, meaning that the system has not missed any criminal from the dataset, but could sometimes falsely detect a non-criminal as a criminal. This system proves to be effective and flexible in intruder detection tasks in audio files with different file sizes and different amounts of speakers that are present in these audio files. Keywords 1 Deep learning, diarization, speaker embeddings, cyber security, intruder detection 1. Introduction overlapping voices, allowing the system to be tested in a variety of conditions and is an ideal Nowadays digital technologies are changing the testing environment. world around us at an incredible speed and we Particular attention was paid to how the are faced with a huge amount of information to systems handled the most common challenges in process every day. This poses a challenge for modern audio, such as noise, overlapping voices, many industries, especially cybersecurity and and varying speaker volumes. It was these big audio data processing, where accurate and, challenging recordings that helped us select the most importantly, timely data analysis becomes best library for the system. a key success factor. This paper dives into this We have developed a methodology for testing topic by proposing the development of a speech and analyzing data to compare diarization diarization system based on state-of-the-art libraries, allowing us not only to assess the machine learning libraries to effectively detect accuracy of recognition, but also to understand intruders by their voices [1–5]. the strengths and challenges of each system and To ensure the effectiveness and accuracy of how best to use them. the developed diarization system, the To evaluate the diarization libraries, we used VoxConverse dataset was used. This dataset metrics that help to objectively assess the contains a wide range of audio recordings, from accuracy and reliability of each library: DER and single speeches to complex discussions with JER. CPITS-2024: Cybersecurity Providing in Information and Telecommunication Systems, February 28, 2024, Kyiv, Ukraine EMAIL illia.zaiets.mkbas.2022@lpnu.ua (I. Zaiets); vitalii.a.brydinskyi@lpnu.ua (V. Brydinskyi); dmytro.v.sabodashko@lpnu.ua (D. Sabodashko); yurii.v.khoma@lpnu.ua (Y. Khoma); khrystyna.s.ruda@lpnu.ua (K. Ruda) ORCID: 0009-0007-0754-0463 (I. Zaiets); 0000-0001-8583-9785 (V. Brydinskyi); 0000-0003-1675-0976 (D. Sabodashko); 0000-0002-4677- 5392 (Y. Khoma); 0000-0001-8644-411X (K. Ruda) ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 228 The results of the evaluation of the diarization • Detecting fraud attempts in telephone libraries were used to select the best one that calls, for example, by identifying would be suitable for building a system inconsistencies in voices or attempts at capable of accurately and efficiently manipulation. identifying and separating speakers in audio • Automate the process of distributing and recordings to detect intruders. Ultimately, the analyzing audio files, simplifying the work Pyannote library was selected as a key element of analysts. of the system difications, which is determined • Protecting the confidentiality of by the comprehensive security system of information by “monitoring” audio multi-level information technology [6, 7]. communications in large organizations to ensure that confidential information is not 2. Materials and Methods disclosed. Overall, audio diarization plays an important Audio diarization technology detects and role in cybersecurity, helping to detect and distinguishes individual speakers in audio prevent fraud, crime, and other cyber threats. recordings. This technology is based on the Such an approach plays an important role in complex analysis of voice data, using machine protecting information and identifying learning and deep learning algorithms to potential threats, which is especially relevant recognize the voice characteristics of each in the context of the growing number of speaker to identify the individuals involved in a cyberattacks and fraudulent activities. conversation. This includes analyzing tone of However, the implementation of audio voice, speaking speed, accents, and other unique diarization in the context of cybersecurity features that distinguish one speaker from faces several challenges. One of the main ones another. Its main task is to divide the audio is the presence of background noise in audio stream into separate segments so that each recordings, which can significantly complicate segment represents the moment when one the process of speaker recognition. To solve person speaks or when there is a change of this problem, various methods of filtering and speakers. cleaning the audio signal are used. Another This process seeks to answer the question: important aspect is the variability of speech “Who speaks when?” throughout the audio features, such as accents, intonations, and recording [8]. Thanks to this, the analysis of speech speed. This requires audio diarization audio materials becomes much easier, especially systems to be highly flexible and able to adapt in situations where there are many participants to a variety of conditions. in a conversation and their voices often overlap Audio diarization involves several critical or change each other. Thus, audio diarization is steps to accurately recognize and separate becoming an indispensable tool for speakers in audio recordings. These stages understanding and analyzing complex audio include Voice Activity Detection, Overlapped recordings, particularly in the context of Speech Detection, Speaker Change Detection, cybersecurity and other areas where speaker Segmentation, Speaker Embedding Extraction, identification accuracy is critical [9]. Examples of Clustering, and Neural Diarizer. how diarization is used: The diarization pipeline is shown in Fig. 1 • Identify different speakers in an audio file. For cybersecurity investigations, where it is important to understand who exactly participated in the conversation. • Analyzing communications, such as intercepted phone calls or meeting notes, can help identify suspicious or malicious activity. • Detecting fraud attempts in telephone Figure 1: Diarization structure containing its calls, for example, by identifying main steps inconsistencies in voices or attempts at manipulation. 229 Automatic Speech Recognition—Fig. 1 and effectively cluster audio segments demonstrates that the ASR process can be used according to speakers. in parallel with diarization, if necessary [10]. Thanks to neural networks, the diarization Voice Activity Detection—the first stage is process becomes more accurate and flexible. voice activity detection, where the system Neural diarizers can efficiently process a large determines whether a voice signal is present in a amount of audio data and are also better able certain audio segment. This enables users to to cope with challenges such as overlapping filter out quiet areas or noise, focusing solely on voices or changing recording conditions [16]. segments with voice activity such as speech [11]. The key aspect of audio diarization is the Speaker Change Detection & Overlapped use of specific machine learning algorithms. Speech Detection—the process includes Neural networks, for example, work effectively speaker change detection and detection of with the task of classifying audio segments by moments when several speakers are speaking speakers. Clustering helps in grouping similar at the same time. The system analyzes the speech features, which makes it easier to audio stream to detect the moments when one identify individual speakers. Speaker speaker finishes speaking and another starts. recognition is important for determining who This helps to divide the audio into segments, is speaking at a particular moment in a each of which reflects the speech of a particular recording. speaker. At this stage, the diarization process Data processing is also an integral part of faces several challenges. First, the quality of the process. Extracting features from audio, the audio recording can vary considerably, and such as tone, speech rate, and other noise, echo, and other audio interference can characteristics, is critical to the accuracy of the make it difficult to identify speakers. Secondly, algorithms. Data normalization ensures the taking into account a variety of speech homogeneity of the input data, which helps in features, including accents, dialects, and improving the accuracy of machine learning intonations, is an important aspect of ensuring models. accurate diarization. Third, voice overlap, When it comes to optimizing and tuning when multiple people speak at the same time, models, it is important to choose the right presents a challenge for accurate speaker hyperparameters to maximize the efficiency of segmentation and identification [12]. the diarization. Optimization strategies include Segmentation then segmentation takes choosing the model architecture, adjusting the place, where the audio recording is divided learning rate, and other parameters that can into smaller parts for detailed analysis. Each affect the result. segment is checked for unique features of the Last but not least, the results are analyzed speaker [13]. and validated. This includes evaluating the Speaker Embedding Extraction is one of accuracy of the algorithms on different datasets the most important stages in the extraction of and analyzing errors, which allows for further speaker characteristics. The system identifies improvement of the diarization methods. unique voice attributes, such as timbre, tempo, The tools used, including PyAnnote [17], and intonation, which allows the creation of a NVIDIA NeMo [18], and SpeechBrain [19], have unique “fingerprint” of each speaker [14]. a variety of functionalities for complex speech Clustering at this stage, clustering takes analysis, speaker identification, and speaker place, where segments with similar diarization. characteristics are organized together. This These are three of the main and most allows the system to recognize and group popular Python libraries used in diarization segments belonging to the same speaker [15]. tasks. PyAnnote is great for automatic audio Neural Diarizer is the final stage that annotation and speaker identification. NVIDIA involves the use of neural diarizers. Neural NeMo offers powerful tools for working with diarizers are deep learning-based systems that neural networks, which is ideal for complex can automatically identify different speakers in diarization tasks. SpeechBrain, with its complex audio recordings. They use powerful flexibility and open-source nature, is another neural networks to analyze audio signals, pick great tool for speech processing and diarization. up subtle differences between different voices, In general, the use of machine learning algorithms in audio diarization tasks is an 230 important step towards the development of In general, Pyannote is an impressive audio speech processing technologies, offering more diarization solution that continues to evolve accurate and efficient solutions. and find new applications in a variety of areas, for both research and commercial use. This 3. Aim of Research library is especially valuable for its ability to perform complex audio processing tasks, providing reliable and accurate results. The main goal of this article is to implement a system that can accurately recognize and separate the voices of speakers in audio b. Nvidia NeMo recordings and detect intruders. This is of great importance not only for information NVIDIA NeMo, which stands for Neural security, but also for other areas where it is Modules, is an innovative approach to machine important to analyze speech accurately. To learning and audio analysis. This library, achieve this goal, we analyzed the capabilities created by NVIDIA, specializes in applying of libraries such as Pyannote, NVIDIA NeMo, deep learning to a variety of speech-processing and SpeechBrain, and built a diarization tasks, including audio diarization. system capable of intruder detection based on NeMo’s special feature is its modular this analysis. architecture, which allows researchers and developers to easily create, customize, and optimize different components of neural 4. Models Overview networks for specific tasks. This makes NeMo a. PyAnnote not only a powerful tool for machine learning experts but also an accessible solution for a Pyannote represents an important direction in wider range of users who may not have deep the development of audio diarization knowledge in this area. algorithms. It is an open-source tool developed In the context of audio diarization, NeMo for audio data processing, especially focused uses advanced deep learning algorithms to on diarization tasks. efficiently recognize and separate speakers in The main advantage of Pyannote is its audio recordings. With its high accuracy and flexibility and high accuracy, provided by the ability to process complex audio data, NeMo is use of deep learning algorithms. It uses neural becoming an important tool in tasks that networks to analyze audio recordings, detect require recognizing different voices, even in speaker identification features or embeddings, the presence of noise or overlapping voices. and separate and classify them. This allows NVIDIA NeMo is also continuously updated Pyannote to efficiently separate audio to include the latest advances in machine recordings into segments, each corresponding learning and speech processing. This ensures to a specific speaker, even in difficult that users have access to the most advanced conditions where voices overlap or there is technologies to solve their problems. background noise. In summary, NVIDIA NeMo plays a In addition to diarization, Pyannote also significant role in today’s audio diarization provides tools for other audio processing process by offering flexible, scalable, and high- tasks, such as voice activity detection, and performance solutions for a variety of research gender and age recognition, making it a and commercial applications in different fields, multifunctional solution. including cybersecurity. Another important aspect of Pyannote is its community and open nature. Developers and c. SpeechBrain researchers can contribute their improvements and adaptations, which SpeechBrain is another important player in the contributes to the constant updating and field of machine learning algorithms for audio improvement of the tool. This also means that diarization. This open-source tool was developed the library is constantly adapting to new as a one-stop solution for a variety of speech- challenges and technological breakthroughs in processing tasks, including audio diarization, audio data processing. speech recognition, and speech synthesis. 231 SpeechBrain is flexible, allowing users to easily duration of speech falsely detected as non- customize and adapt the system to their specific speech in an audio file, and Tc is the duration of needs. The use of deep learning algorithms speaker confusion in an audio file. allows SpeechBrain to efficiently process Jaccard Error Rate (JER) is an error that complex audio recordings and accurately determines how often the speakers are falsely separate the speech of different speakers by detected as other speakers. This metric is returning their embeddings. based on the Jaccard index, which measures One of the advantages of SpeechBrain is its the similarity between the sets. ability to handle large amounts of data, making it Jaccard error rate can be calculated using an ideal solution for processing audio recordings the following equation: on the scale required today. Also important is its ∑𝑁𝑖=0|𝑆𝑖 ∩ 𝐷𝑖 | (2) ability to adapt to different recording conditions, 𝐽𝐸𝑅 = 1 − 𝑁 , ∑𝑖=0|𝑆𝑖 ∪ 𝐷𝑖 | including different languages, accents, and sound where Si is the set of the segments for a quality. speaker i in the test dataset, Di is the set of the SpeechBrain is also characterized by its segments where speaker i was predicted, and openness, which facilitates a community of N is the total number of speakers. researchers and developers to work together to improve and adapt the tool. This creates a b. Initial Analysis dynamic environment for innovation and development in the field of speech processing. Overall, SpeechBrain offers a feature-rich Before starting extensive testing on massive data solution for audio diarization tasks, providing from various Python libraries, we first conducted high accuracy, flexibility, and scalability, which is experiments with the Pyannote algorithm using important for a wide range of applications, from audio recordings with a wide range of conditions. scientific research to commercial audio This included files with different numbers of processing projects. participants, variations in noise levels, and different degrees of speech overlap. This way, we can better understand Pyannote’s performance 5. Experiment Setup and reliability in different acoustic scenarios, which is critical for further analysis of larger A high-performance NVIDIA RTX 3090 data. graphics card was chosen to effectively solve Initially, a two-minute audio file was selected the tasks of audio diarization. This choice was for analysis, which was a recording of a news made due to its high computing power and broadcast. The peculiarity of this recording was optimization for deep learning tasks, which is the high level of background noise, although critical for the efficient processing and analysis there were no overlapping audio tracks. This of large amounts of audio data. choice allowed us to evaluate how efficiently the algorithm can process audio with complex sound a. Metrics environment conditions, not complicated by the simultaneous speech of several speakers. Diarization Error Rate (DER)—error of The audio file with two speakers and detecting the segment’s boundaries and overlaps background noise is visualized in Fig. 2. in the audio recording considering the true or false assignment of speaker identifier to the audio recording segment. This error is the main for diarization and is to be the generally accepted metric in commercial systems. Diarization error rate can be calculated using the following equation: 𝑇𝑎 + 𝑇𝑚 + 𝑇𝑐 (1) Figure 2: Timeline of an audio file containing 𝐷𝐸𝑅 = , 𝑇 two speakers with background noise where T is the total duration of an audio file, Ta is the duration of non-speech falsely The next step in the study was a more complex detected as speech in an audio file, Tm is the task for the Pyannote library. A 16-minute 232 audio recording of a conference with 11 people Table 1 present at the same time was chosen for Initial analysis results analysis. This recording was characterized by a Conditions DER JER significant level of noise, changes in speech Noisy environment; 2 speakers 0.19 0.19 Noisy environment; 2 speakers 0.76 0.75 volume, and frequent interruptions between Clean environment; 2 speakers 0.07 0.07 speakers. This made it possible to evaluate Pyannote’s ability to effectively cope with the high level of complexity in speech recognition From the experiment results it can be seen and speaker identification in multi-voice audio. that the diarization system performed well on The audio file with eleven speakers with the smaller amount of speakers, though a bit background noise is visualized in Fig. 3. worse when the noisy environment was present. When it comes to diarization on the bigger number of speakers in the noisy environment, the system did not perform well, so it is not recommended to use this system with data, where there are a lot of speakers with potential overlaps and a noisy environment on top. c. Test Dataset for Model Selection The VoxConverse dataset [20] was chosen to Figure 3: Timeline of an audio file containing evaluate popular Python diarization libraries. eleven speakers VoxConverse is an extensive dataset that was created for speech diarization tasks and is a The study was continued by selecting an audio good resource for researchers and developers recording of a clean speech with no in this field. This dataset includes a large background noise, overlapping tracks, or number of audio recordings that cover a wide interruptions in speech. This recording is ~15 range of scenarios from public speeches and minutes long and represents ideal conditions interviews to newscasts and debates. A special for analysis, which makes it possible to feature of VoxConverse is the presence of evaluate the algorithm’s performance under recordings where speech overlap is observed, optimal conditions. This choice allows you to which is very typical in real-world settings and establish a baseline level of accuracy of the is of great interest for research. diarization system under ideal conditions, The audio recordings in VoxConverse are without external interference. annotated with detailed labels that include The audio file with two speakers with a time intervals and speaker identifiers. This clean background is shown in Fig. 4 with information is extremely valuable as it allows background noise. us to accurately assess how different diarization algorithms and systems perform in detecting and distributing speech among different speakers. Such annotations are important for comparing the results of diarization systems with the “ideal” and evaluating their effectiveness. The large amount of data in VoxConverse enables deep and comprehensive analysis. Figure 4: Timeline of an audio file containing This allows researchers to evaluate dialysis two speakers with clear audio systems in a variety of settings, including Table 1 contains the results of the initial study scenarios with a variable number of speakers, of the robustness of the diarization library different noise levels, and different speech Pyannote. styles and accents. This diversity helps to improve the reliability and accuracy of dialysis 233 systems and contributes to the development of evaluate which algorithm is the most efficient more versatile and adaptive solutions. in this parameter. This analysis will not only Thanks to its openness and accessibility, help identify the most accurate voice VoxConverse has become a valuable tool for recognition system but also determine which the community to conduct collaborative one provides the best ratio of speed and quality research and development in the field of of data processing. speech diarization. The use of such datasets The model selection experiment results are helps researchers identify new challenges that presented in Table 2. modern systems may face and develop more Table 2 efficient algorithms for speech processing. Model selection experiment results Out of the entire VoxConverse dataset of Model Elapsed time DER JER 464 records, the first 50 records of the dataset SpeechBrain 3m 53s 0.31 0.31 were selected for the tests to reduce the time NVIDIA NeMo 17m 32s 0.41 0.41 to perform the diarization and reduce resource Pyannote 20m 7s 0.14 0.14 usage. These selected records have different lengths, ranging from 3 to 20 minutes, which SpeechBrain, an open-source machine provides a wide range of conditions to evaluate learning library, impressed with its processing the performance of my chosen Python speed, taking only 3 minutes and 53 seconds, machine-learning libraries. Not only does this although accuracy leaves much to be desired approach allow for a focus on detail, but it also and additional tuning is required to achieve provides practical relevance by demonstrating optimal results. The average DER of 31% can how systems adapt to variability in real-world be considered satisfactory, given the openness speech scenarios. This helps to gain a deeper of the library and the complexity of the data. understanding of each system's performance NVIDIA’s NeMo took longer to process—17 in situations that may occur in real life and minutes and 32 seconds—but showed good identify potential areas for further diarization results, especially given the improvement. complexity of the audio data. The average DER of 14% indicates the efficiency of the d. Model Selection for Intruder algorithm. Detection System For the intruder detection system, it was decided to use Pyannote, which showed the For each of the selected libraries, we developed best results in diarization of this dataset. With the appropriate code, taking into account their an average DER of 9%, Pyannote effectively unique features. The goal was to ensure that handles the challenges of the dataset. Despite the final result in each of them complied with the fact that Pyannote’s processing time was the generally accepted RTTM standard for 20 minutes and 7 seconds, this is compensated diarization timestamps. This methodology by its high accuracy and high-quality provided the ability to equally evaluate and documentation, which allows users to quickly compare the results obtained using different get started with the library. Although the diarization systems. processing time is not a decisive factor As part of the experiment, we used models compared to NeMo, the time to implement and of these libraries trained on the VoxConverse configure Pyannote was significantly shorter. dataset to evaluate their effectiveness in real- Thus, given the speed, accuracy, and ease of world conditions. The main goal of this use, Pyannote was the choice for the final task, experiment is to determine which of these demonstrating an excellent balance between libraries is best suited for the final task of processing time and quality of results. detecting an intruder. The following metrics were chosen to evaluate the performance of each library: average DER and JER. These metrics were calculated based on 50 selected test recordings from VoxConverse. We also took into account the diarization time for each system to 234 e. Diarization for Intruder Detection Intruder Detection System Task Implementation Data Preparation Moving forward, the subsequent phase in our research involves the creation of a bespoke Before developing and analyzing an intruder algorithm. This algorithm is tailored detection system, it is necessary to collect and specifically to extract embeddings from audio prepare data that will be used to train and test recordings that contain the vocal patterns of the model. This stage involves selecting individuals labeled as “intruders”. The core appropriate audio recordings and processing process of this development entails the them to ensure effective training of the system. transformation of the distinctive vocal After creating a reliable and representative characteristics of each speaker into complex, training sample, the next step is to develop a high-dimensional numerical vectors. These method for detecting and identifying potential vectors are a crucial element as they criminals in the database. The use of various encapsulate the unique voice features in a methods of speech diarization will allow us to quantifiable form. test the effectiveness of the system and ensure The strategic utilization of these voice its practical use in real-life scenarios. embeddings plays a vital role in our study. It For the study, several episodes of a well- enables a more refined and in-depth known Ukrainian YouTube podcast were comparison and analytical process. This is selected, where two hosts are constantly achieved by measuring the cosine distances participating and different guests come to each between these numerical vectors. By analyzing episode. In the experiment, some guests were these distances, we can ascertain with a high conditionally labeled as “intruders”. Five degree of precision whether a particular separate three-minute audio recordings were segment of speech can be attributed to a created for each guest to extract their voice specific “criminal” or another speaker. This embeddings. The total size of the dataset for methodology is highly effective in identifying “intruders” is 56 recordings, each of distinguishing between different voices in an which is two to three minutes long. Out of this audio recording. This approach is key to the number, 15 recordings include the voices of development of systems used in forensic the identified “intruders”, while the remaining research and other areas where it is necessary 41 do not. This provides a unique opportunity to accurately identify a person by voice. to evaluate how well the developed diarization The next step in the research is to apply an system performs in recognizing and separating algorithm to collect embeddings from all voices in real-life situations, which is key for suspect recordings in the database. This application in practical scenarios. process involves analyzing each audio file and The dataset used for this experiment can be extracting the corresponding embeddings. found here [21]. Once the embeddings are collected, they are The example of a prepared audio file for clustered. This procedure allows you to group intruder detection is shown in Fig. 5. similar voice characteristics, which is key to simplifying the subsequent identification process. Clustering reduces the need to make multiple comparisons between each segment’s echo and all of the offender’s echoes, thereby increasing the efficiency and accuracy of identification. In addition, clustering helps to identify common characteristics of the voices of the “intruders”, which can help to accurately identify potential suspects. The processing of podcasts includes downloading each audio file, running Figure 5: Example of an audio file containing diarization on this file, and then selecting only an intruder’s voice the segments larger than 5 seconds. This is done 235 to ensure detailed analysis and accurate identify the strengths and weaknesses of the identification of the different voices in the program, as well as possible areas for further recording. Each segment is then checked against improvement. Testing on a dataset will not only the attacker’s speech patterns, which are confirm the program’s ability to effectively predefined and stored in a database. This recognize intruder voices but will also provide technique allows you to accurately identify the valuable insight into its overall accuracy and moments of the suspect’s presence in the audio reliability in various use cases. material and also provides the ability to identify During the meticulous analysis of the dataset, attackers who speak only in certain parts of the the algorithm demonstrated a remarkable level podcast. This is important to ensure high of accuracy in identification tasks. Among the identification accuracy without affecting entirety of the audio files that were processed, it processing speed. is noteworthy that only a single file was Before comparing segments, you need to erroneously classified as containing the voice of specify a key threshold parameter that can a criminal. This particular outcome may serve as dramatically change the results of the study. The an indicator of possible limitations inherent threshold plays a crucial role in determining within the algorithm itself, or it could whether a voice in a podcast segment matches alternatively point to specific characteristics of the voice of a known intruder. It serves as a the audio file that might have influenced its measure for comparing the level of similarity recognition capabilities. Significantly, all other between voice embeddings. The key point is that files within the dataset were identified with a if the cosine distance between the segment’s high degree of accuracy, a fact that robustly embedding and the nearest intruder’s affirms the effectiveness and reliability of the embedding is less than this threshold, the system system we have developed. recognizes the presence of an intruder in that Moreover, this solitary instance of segment. misidentification, while being an outlier, is also of The intruder detection system is shown in considerable value. It offers critical insights and Fig. 6. catalyzes further in-depth analysis and fine- tuning of the algorithm. By closely examining this case, we can gain a deeper understanding of the algorithm’s current capabilities and limitations. This understanding is instrumental in guiding subsequent enhancements and optimizations. Our goal is to refine the algorithm’s precision in accurately distinguishing between the presence of intruders and non-intruders across a diverse range of audio scenarios. This ongoing process of improvement is pivotal in ensuring that the system remains highly efficient and effective in various real-world applications. Figure 6: Intruder detection system Intruder Detection System Experiment After the development of the main components of the system is completed, the next stage will be its launch and testing on a selected data set. Figure 6: Intruder detection experiment This will allow us to evaluate the functionality confusion matrix and efficiency of the developed system in real conditions. An important part of this process Drawing upon the data derived from the will be the analysis of the results, which will help confusion matrix, as illustrated in Figure 6, we 236 can compute several crucial algorithm application in detecting intruders. We evaluated performance metrics, notably Accuracy, the resilience and effectiveness of these Precision, Recall, and the F1-Score. These diarization systems across a spectrum of metrics are indispensable as they furnish environmental conditions. Central to our study is insightful details regarding the model’s the development of an innovative intruder proficiency in precisely detecting intruders. detection system, which is fundamentally based Furthermore, they shed light on the model’s on the principles and technology of speaker dependability in minimizing instances of false diarization. The results of our investigation positives and false negatives. The attainment reveal a notable compatibility of diarization of high values in these metrics is a clear models within the realm of intruder detection, indicator that the system we have developed is particularly highlighted by their proficiency in highly competent in its designated functions. It identifying unauthorized individuals within excels in identifying intruders with remarkable audio recordings or live audio streams. A key accuracy and is characterized by a minimal outcome of our experimental findings is the occurrence of errors. This aspect of the discernible superiority of the Pyannote system’s performance is not only a testament diarization model. This model demonstrated to its effectiveness but also highlights its exceptional performance in diarization, reliability in critical situations where the evidenced by achieving the lowest DER at 14% accurate detection of intruders is paramount. and the lowest JER at 14% as well. Despite its relatively slower inference time compared to Table 3 other models, the accuracy and reliability it Intruder detection experiment results Accuracy, % Recall, % Precision, % F1-score, % brings to intruder detection significantly 98.21 100.00 93.75 96.77 outweigh this limitation. In the development of an intruder detection system, we chose to implement the Pyannote The conclusion of this audio diarization diarization model as its core component. The experiment revealed that high accuracy in performance of the system was remarkably high, detecting intruders is achievable, but requires demonstrating an accuracy rate of 98.21%. This careful tuning of the system to the characteristics high level of accuracy was further complemented of each audio recording. The key factors affecting by a perfect recall rate of 100.0%, indicating that the success of identification are the every single intruder present in the dataset was “min_segment_duration” and successfully identified by the system. “similarity_threshold” hyperparameters. Setting Additionally, the system exhibited a precision of the minimum segment duration helps to avoid 93.75%, which, although not flawless, is misidentifying intruders, although it may result significantly commendable. The F1-score, which in missing their short utterances. On the other is a balanced measure of the system’s precision hand, fine-tuning the similarity threshold for and recall, stood at an impressive 96.77%, embeddings is important for accurately underscoring the system’s overall efficacy. recognizing the voices of intruders while It is noteworthy, however, that a small avoiding false positives. You should also pay fraction of the speakers were incorrectly attention to the timbre of the voice, as it can classified as intruders. However, it is significantly improve the results, especially overshadowed by the system’s paramount when voices with similar characteristics are accomplishment: its unfailing ability to detect present in the audio recording. Thus, an every intruder included in the dataset. This individualized approach to each audio file and its aspect, above all, highlights the system’s value as features is the key to effectively detecting a reliable tool in intruder detection scenarios. criminal activity in different audio contexts. 6. Conclusions References [1] O. Romanovskyi, et al., Prototyping In this paper, we conducted a comprehensive Methodology of End-to-End Speech analysis that compares various deep learning Analytics Software, in: 4th International models specifically in the sphere of speaker Workshop on Modern Machine Learning diarization, with a particular focus on their 237 Technologies and Data Science, vol. 3312 Library, Sensors 23(4) (2023) 2082. doi: (2022) 76–86. 10.3390/s23042082. [2] I. Iosifov, et al., Transferability [10] A. Hannun, et al., Deep Speech: Scaling up Evaluation of Speech Emotion Recog- end-to-end Speech Recognition, arXiv: nition Between Different Languages, preprint (2014). Advances in Computer Science for [11] J. Ball, Voice Activity Detection (VAD) in Engineering and Education 134 (2022) Noisy Environments, ArXiv (2023). 413–426. doi: 10.1007/978-3-031- [12] S. Cornell, et al., Overlapped Speech 04812-8_35. Detection and Speaker Counting Using [3] I. Iosifov, O. Iosifova, V. Sokolov, Senten- Distant Microphone Arrays, Comput. ce Segmentation from Unformatted Text Speech Lang. 72 (2022) 101306. doi: using Language Modeling and Sequence 10.1016/j.csl.2021.101306. Labeling Approaches, in: VII [13] M. Kotti, V. Moschou, C. Kotropoulos, International Scientific and Practical Speaker Segmentation and Clustering, Conference Problems of Signal Process. 88(5) (2008) 1091–1124. Infocommunications. Science and doi: 10.1016/j.sigpro.2007.11.017. Technology (2020) 335–337. doi: [14] M. Jakubec, et al., Deep Speaker 10.1109/PICST51311.2020.9468084. Embeddings for Speaker Verification: [4] O. Iosifova, et al., Analysis of Automatic Review and Experimental Comparison, Speech Recognition Methods, in: Eng. Appl. Artif. Intell. 127 (2024) Workshop on Cybersecurity Providing in 107232. doi: 10.1016/j.engappai.2023. Information and Telecommunication 107232. Systems, vol. 2923 (2021) 252–257. [15] N. Dawalatabad, et al., ECAPA-TDNN [5] O. Iosifova, et al., Techniques Embeddings for Speaker Diarization, Comparison for Natural Language Proc. Interspeech (2021) 3560–3564. Processing, in: 2nd International doi: 10.21437/Interspeech.2021-941. Workshop on Modern Machine Learning [16] D. Garcia-Romero, et al., Speaker Technologies and Data Science, vol. Diarization Using Deep Neural Network 2631, no. I (2020) 57–67. Embeddings, IEEE International [6] V. Dudykevych, H. Mykytyn, K. Ruda, The Conference on Acoustics, Speech and Concept of a Deepfake Detection System Signal Processing (ICASSP) (2017) of Biometric Image Modifications Based 4930–4934. doi: 10.1109/ICASSP.2017. on Neural Networks, IEEE 3rd KhPI Week 7953094. on Advanced Technology (KhPIWeek) [17] H. Bredin, Pyannote.Audio 2.1 Speaker (2022). doi: 10.1109/khpiweek57572. Diarization Pipeline: Principle, 2022.9916378. Benchmark, and Recipe, INTERSPEECH [7] Y. Shtefaniuk, I. Opirskyy, Comparative (2023) 1983–1987. doi: 10.21437/ Analysis of the Efficiency of Modern Fake interspeech.2023-105. Detection Algorithms in Scope of [18] E. Harper, et al. NeMo: A Toolkit for Information Warfare, 11th IEEE Conversational AI and Large Language International Conference on Intelligent Models. URL: https://github.com/ Data Acquisition and Advanced NVIDIA/NeMo Computing Systems: Technology and [19] M. Ravanelli, et al., SpeechBrain: A Applications (2021) 207–211. doi: 10. General-Purpose Speech Toolkit, ArXiv 1109/IDAACS53288.2021.9660924.1 (2021). [8] X. Miro, et al., Speaker Diarization: A [20] J. Chung, et al, Spot the Conversation: Review of Recent Research, IEEE Trans. Speaker Diarisation in the Wild, Audio, Speech, Lang. Process. 20(2) INTERSPEECH (2020) 299–303. doi: (2012) 356–370. doi: 10.1109/tasl. 10.21437/interspeech.2020-2337. 2011.2125954. [21] I. Zaiets, Dataset of Ukrainian Podcasts [9] V. Khoma, et al., Development of for Intruder Detection by Voice (2024). Supervised Speaker Diarization System doi: 10.57967/hf/0701. Based on the PyAnnote Audio Processing 238