1. Introduction

Maksym Kizitskyi

maksym.kizitskyi@nure.ua 0 1

Olena Turuta

0 1

Oleksii Turuta

0 1 0 Kharkiv National University of Radio Electronics , 14 Nauky Ave., Kharkiv, 61166 , Ukraine 1 Speaker verification , ConvNext, DOLG Architecture, Multilingual Training

Speaker verification is an essential task in speech processing. Implementation this task based on convolutional neural networks. Several key metrics were evaluated, including equal error rate and precision top-K, and were compared the performance of different architectures and loss functions. The experiments are conducted using a Ukrainian dataset and include comparisons of models trained on multilingual data, as well as models trained on clean and augmented data. The results are presented in tables and figures, showing that even for lowresource languages, the models can achieve good performance metrics. The authors also discuss the implications of their findings and the potential for transferring skills to other languages. The paper provides valuable insights for researchers working in the field of speaker verification.

1. Introduction

In today's digital age, speech recognition and speaker verification techniques have become increasingly important for a variety of applications. These technologies have revolutionized the way we interact with machines, allowing for seamless communication and automation in various fields, from personal assistants to security systems. Speech recognition refers to the ability of machines to identify and transcribe human speech, while speaker verification focuses on verifying the identity of the person speaking. Both technologies have numerous practical applications, including improving accessibility for individuals with disabilities, enhancing the user experience of devices and applications, and enhancing security measures in industries such as banking and finance. Thus, understanding the importance and potential of speech recognition and speaker verification is crucial for those interested in the future of technology and its impact on society.

Despite the vast potential of speech recognition and speaker verification technologies, there are still significant challenges in implementing them for low-resource languages [ 1 ] like Ukrainian. Many of these languages lack the necessary data and resources to develop robust and accurate models. However, recent advancements in machine learning, particularly in deep learning techniques, have made it possible to overcome some of these limitations and enable the development of speech and speaker recognition models for these languages.

The potential impact of these technologies on low-resource languages is immense. Speech recognition can greatly improve accessibility for individuals who speak these languages, allowing them to communicate more effectively with technology and access a wider range of digital content. Speaker verification can also enhance security measures in industries like finance and government, enabling secure authentication of individuals who speak these languages.

Furthermore, the development of speech recognition and speaker verification models for lowresource languages can have broader socio-economic benefits. For example, it can improve the

2023 Copyright for this paper by its authors. efficiency and accuracy of customer service for businesses operating in these regions, increasing customer satisfaction and loyalty. It can also facilitate the development of new tools and applications that are specifically tailored to the needs of these populations, enhancing their digital literacy and participation in the global digital economy.

The aim of the work is to develop an approach to perform highly accurate (comparable with performance of SOTA models for resource-rich language) speaker verification for low-resource languages like Ukrainian.

So the goal of this work is to: • Develop a robust speaker verification system • Study the effectiveness of transferring skills of speaker verification from other languages • Compare the effectiveness of different approaches and algorithms

2. Related works

Speaker verification is the process of verifying the identity of a person based on their voice. This process is often used in security systems, access control and other applications where identification is required. However, speaker verification systems are typically designed for high resource languages, leaving low resource languages with limited options. In this literature review we will explore the stateof-the-art research on speaker verification for low resource languages.

In recent years, researchers have attempted to address the issue of speaker verification for low resource languages by developing systems that are capable of identifying individuals who speak less common languages. These efforts have been driven by the need to ensure that all people, regardless of their language, can have access to secure and reliable identification systems.

One approach that has been used to overcome the lack of resources for low resource languages is data augmentation. This technique involves creating new data from existing data by applying various transformations such as pitch shifting, noise addition and speed variation. In a study by Chen et al. (2021) [ 2 ], the authors proposed a data augmentation method for speaker verification in low resource languages using a combination of noise addition, reverberation and pitch shifting. The authors reported that their proposed method outperformed the baseline approach, which only used the original data.

Another approach that has been explored is transfer learning, which involves training a model on a resource-rich language and then fine-tuning it for a low resource language. In a study by Sigtia et al. (2018) [ 3 ], the authors proposed a transfer learning method for speech recognition in Swahili, a low resource language spoken in East Africa. The authors trained a deep neural network (DNN) on a large dataset of English speech and then fine-tuned the model on a smaller dataset of Swahili speech. The authors reported that their proposed method outperformed the baseline approach, which only used the small dataset of Swahili speech.

In addition to data augmentation and transfer learning, other approaches have also been explored, such as unsupervised speaker adaptation and speaker diarization. Unsupervised speaker adaptation involves adapting a pre-trained model to a new speaker without requiring any labeled data. In a study by Gautam et al. (2019) [ 4 ], the authors proposed an unsupervised speaker adaptation method for speaker verification in Hindi, a low resource language spoken in India. The authors reported that their proposed method outperformed the baseline approach, which required labeled data.

In conclusion, the research on speaker verification for low resource languages is an emerging area of study, and several approaches have been proposed to address this issue. Data augmentation, transfer learning, unsupervised speaker adaptation and speaker diarization are some of the approaches that have been explored. While these approaches have shown promise, there is still much work to be done to develop accurate and reliable speaker verification systems for low resource languages.

3. Methods and materials

Consider the data that will be used in further experiments, some other materials and methods proposed to solve the problem under consideration.

Dataset Description

As a base dataset we have chosen Common Voice dataset [ 5 ]. It`s a crowd source dataset that contains a lot of audio recordings of different speakers even for low resource languages like Ukrainian. Large number of speakers is essential to build robust speaker verification system. The Ukrainian dataset contains 73 hours of recording of 120 speakers in train split and 14 hours of 639 speakers in test split.

In some experiments we additionally use datasets in other languages (language 1 and language 2) in training process. The data about duration and number of unique speakers is presented in Table 1.

In order not to overfit on speakers with a few number of recordings we dropped speakers with less than 40 recordings from training dataset. On figure 1 shown the histograms of number of recordings per speaker.

Also in some experiments we limit the number of recordings by speaker for Ukrainian recordings in order to make dataset no balanced.

As a step of feature extraction we split each audio into 3 second chunks and extracted spectrogram from them. After it we normalized them and from this step we could process them like images.

During the training process in some experiments, we applied Mell spectrogram augmentations such as time and frequency masking. This was done in order to prevent overfitting and make model robust to real-world data.

In order to evaluate model on real-world data we additionally collected recordings of interviews, department meeting in Google Meet, etc. 3.2.

Methods

We have chosen as key metrics: 1) Equal error rate – it is one of the most widely used metric to evaluate speaker verification models. 2) Precision Top-K – spends for fraction of examples in Top-K most similar data points with the original one. In our experiments we used K equal to (3, 5, 10). 3) Mean and standard deviation of positive similarity – mean and standard deviation of cosine similarity between examples of the same class. 4) Mean and standard deviation of negative similarity – mean and standard deviation of cosine similarity between examples that do not belong to the same class as an original data point.

4. Experiment

We have chosen as a beck bone a ConvNext [ 6 ] because it`s one of the best performing convolutional neural network architectures in computer vision tasks, such as an ImageNet. We used randomly initialized weights, because Mel spectrograms are completely different, from datasets network was trained on so it`s unlikely that pre training will give an advantage in the task of speaker verification. Because of limitations in computation resources we have only tried to use small and tiny version of it.

In order to improve the model we were experimenting with the DOLG architecture [ 7 ] which showed SOTA results in face recognition and image retrieval. It originally used ResNet as a backbone, so we have to adapt it to our task and ConvNext as a backbone. different loss function: Triplet loss, ArcFace loss, Sub center Arcface loss, ArcFace loss + Triplet loss, Sub center Arcface loss + Triplet loss. 3) Compare the performance of the best architecture from previous experiments training on large datasets, which includes other languages. Validation is performed only on Ukrainian dataset. This experiment will help to determine possibility of transferring skills from other languages in the task of speaker verification. Since the size of dataset is increased network is trained only for 4 epochs. 4) Compare the performance of the best model from previous experiment with the same model as a backbone in DOLG architecture. This experiment will help to identify the possibility of applying DOLG architecture in the task of speaker verification. 5) Compare the performance of the model from previous experiments trained on clean data and augmented data. This experiment is aimed to determine how the usage of augmented data effects the training proses.

After all of this experiments we perfumed speaker diarization on our dataset using the best model from previous experiments. To achieve it we split audio into parts of 3 seconds, transformed it to Mell spectrogram and got embedding by our model. Then they were clustered using KMeans algorithm.

Training will be carried out in the Kaggle environment using P100 GPU.

5. Results 5.1. ML Results

The results of the experiments are shown in Figures 4 – Figure 6 and in Tables 2. All the graphs are shown in the appendix A.

Figure 6 shows the change of loss and precision in the top 3 during the fourth and fifth experiments. DOLG architecture shows better initial performance and better performance in general. Also data augmentations slightly improved the model`s performance and robustness to new data.

mean_neg mean_pos std_neg std_pos eer_mean 4,779827 0,006445 0,512067 0,06615 0,191957 0,05745 2,266273 0,143715 0,621793 0,136854 0,188838 0,120028 1,056141 0,098119 0,543737 0,112699 0,226304 0,130431 0,819421 0,012565 0,559775 0,122927 0,183735 0,068941 20,94974 0,064958 0,634587 0,18114 0,201048 0,116371 12,71726 0,049018 0,682394 0,205036 0,192486 0,105805 convnext_tiny sub center

Speaker verification

convnext_small Speaker verification convnext_tiny arcface

In order to test model performance on the real-world data we performed speaker diarization of Google Meet call between 2 speakers. First of all, we split the audio into windows of 3 seconds each. Next we transformed the raw audio into mel spectrograms and extracted embeddings using our model. These embeddings were clustered using KMeans algorithm. The results of clusterization are shown on figure 7.

As we can see from the plots, there are 2 large clusters which represents speakers. The boundary region between clusters represents fragments, where both speakers are active.

As a next step each embedding was matched with the corresponding timestamp. The result was formatted according to srt format and is shown on figure 8.

In conclusion, model trained for speaker verification showed good results in the task of speaker diarization on real-world data.

6. Discussions

As a result of the first experiment it was shown, that even for low-resource languages models can achieve quite good performance metrics. Also results of both convnext tiny and convnext small are quite similar. For both networks we can see that after the 6th epoch the precision at n starts to decrees or stay approximately the same. That may indicate the overfitting of the networks. Also after the 6th epoch negative std reached plateau and don’t decrease as fast as before. On the other hand, std of positive examples is constantly increasing over training. So it was decided to use convnext tiny, because it has less parameters, so following experiments can be performed faster. The question of performance of large networks (like base, or large) is still open, so probably they can perform better in the task of speaker verification.

In the second experiment we compared different loss functions. In the end all networks performed approximately the same. But losses that contain triplet loss performed a bit worse than ArcFace and sub center ArcFace. These losses reached plateau faster and convergence slower. In general, all the metrics follow the same trend like in previous one. We have chosen sub center arcface because it shows more robustness to a new data, while keeping good performance metrics.

In the third experiment we compared the model trained only on one language with trained on multilingual dataset. Multilingual models show a superior metrics on test Ukrainian set and achieve better results in general. But the model that was trained on fully multilingual dataset reached plateau faster than one trained one balanced (where number of recordings per speaker is approximately the same as in a target language), which may indicate overfitting to languages with more speakers. So transferring of skills for low resource languages, like Ukrainian, from other languages is quite effective, but in order to achieve better results, dataset should be balanced.

In the fourth experiment we compared ConvNext with DOLG pipeline with ConvNext as a backbone on balanced multilingual dataset. DOLG shows superior results, and pretty much achieved SOTA result in the task of speaker verification. In addition, it was trained only for 6 epochs, so it may possible achieve better results with further training.

In the fifth experiment we applied augmentations to spectrogram and repeated previous experiment. As a result, the model achieved even better level of performance and robustness.

Next we tried to analyze with the help of the model from last experiment real-word data – Google Meet call of 2 people. So it performed quite well. However, sometimes if there was no sound, the model can treat it as a separate speaker. So in conclusion we recommend to use ConvNext tiny as a backbone in DOLG pipeline to achieve SOTA results.

7. Conclusions

The paper presents the study on speaker verification using deep learning models. The study used four key metrics to evaluate the performance of the models: equal error rate, precision top-K, mean and standard deviation of positive similarity, mean and standard deviation of negative similarity. The study compared the performance of different network architectures, such as ConvNext and DOLG, and different loss functions, such as TripletLoss, ArcFace, and Sub center ArcFace. The study also compared the performance of models trained on single language datasets and those trained on multilingual datasets.

The experiments showed that even for low-resource languages, the models can achieve quite good performance metrics. The results indicated that the ConvNext tiny model performed better than the ConvNext small model. The study also found that Sub center ArcFace loss showed more robustness to new data while maintaining good performance metrics. Furthermore, the study showed that transferring skills from other languages to low-resource languages was quite effective in achieving better performance metrics. Finally, the study performed speaker diarization on the dataset using the best model from previous experiments, achieving good results.

In conclusion, the study demonstrated the effectiveness of deep learning models in the task of speaker verification, even for low-resource languages. The study provides insights into the bestperforming network architectures and loss functions for this task and shows the potential for transferring skills from other languages to low-resource languages. The findings of this study could have significant implications for developing better speaker verification systems.

The perspective of future studding includes comparison of large amount of convolutional neural network architectures (especially with large number of parameters), different loss functions and their combinations. Also it`s quite important to study transfer learning from other languages and perform multilingual speaker verification.

8. References

Appendix A

a) b) Figure A.1: Change of metric during the first experiment a) mean positive similarity b) loss a) b) Figure A.2: Change of metric during the first experiment a) precision at top 3 b) precision at top 3 Figure A.3: Change of metric during the first experiment a) negative standard deviation b) mean negative similarity Figure A.6: Change of metric during the second experiment a) precision at top 5 b) negative standard deviation

a) b) Figure A.7: Change of metric during the second experiment a) mean negative similarity b) precision at top 10

a) b) Figure A.9: Change of metric during the third experiment a) equal error rate b) mean positive similarity Figure A.12: Change of positive standard deviation during the third experiment Figure A.15: Change of metric during the fourth and fifth experiments a) mean negative similarity b) precision at top 10

[1]

Erdem et al., 'Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning' . 06 -Apr- 2022 .

[2]

Zevallos , ' Text-To-Speech Data Augmentation for Low Resource Speech Recognition' . arXiv, 2022 .

[3]

Gelas ,

Besacier , and

Pellegrino , ' Developments of Swahili resources for an automatic speech recognition system' , in Workshop on Spoken Language Technologies for Under-resourced Languages , 2012 .

[4]

Brummer ,

Mccree ,

Shum ,

Garcia-Romero , and

Vaquero , ' Unsupervised Domain Adaptation for I-Vector Speaker Recognition' , in Proc. The Speaker and Language Recognition Workshop (Odyssey 2014 ), 2014 , pp. 260 - 264 .

[5]

Ardila et al., 'Common Voice: A Massively-Multilingual Speech Corpus' , in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020 ), 2020 , pp. 4211 - 4215 .

[6]

Liu ,

Mao , C.-Y. Wu , C.

Feichtenhofer , T.

Darrell , and S.

Xie , ' A ConvNet for the 2020s' , in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022 , pp. 11966 - 11976 .

[7]

Yang et al., '

DOLG

: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features', 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11752 - 11761 , 2021 .

[8]

Hoffer and

Ailon , ' Deep Metric Learning Using Triplet Network' , in Similarity-Based Pattern Recognition , 2015 , pp. 84 - 92 .

[9]

Deng ,

Guo ,

Xue , and

Zafeiriou , ' ArcFace: Additive Angular Margin Loss for Deep Face Recognition' , in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019 , pp. 4685 - 4694 .

[10]

Deng ,

Guo , T. Liu,

Gong , and

Zafeiriou , ' Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces' , in Computer Vision -- ECCV 2020 , 2020 , pp. 741 - 757 .

[11]

Musgrave ,

Belongie , and

S.-N.

Lim , ' PyTorch Metric Learning' . arXiv, 2020 .

[12]

Zhang ,

Zhao ,

Qiao ,

Wang , and

Li , ' AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations' , 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 10815 - 10824 , 2019 .

[13]

A. L.

Yerokhin ,

A. S.

Babii ,

A. S.

Nechyporenko ,

O. P.

Turuta , A Lars-Based Method of the Construction of a Fuzzy Regression Model for the Selection of Significant Features, Cybernetics and Systems Analysis , Vol. 52 , Issue

, ( 2016 ), 641 - 646 . https://doi.org/10.1007/s10559-016- 9867-5