<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maksym Kizitskyi</string-name>
          <email>maksym.kizitskyi@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olena Turuta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksii Turuta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>14 Nauky Ave., Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Speaker verification</institution>
          ,
          <addr-line>ConvNext, DOLG Architecture, Multilingual Training</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Speaker verification is an essential task in speech processing. Implementation this task based on convolutional neural networks. Several key metrics were evaluated, including equal error rate and precision top-K, and were compared the performance of different architectures and loss functions. The experiments are conducted using a Ukrainian dataset and include comparisons of models trained on multilingual data, as well as models trained on clean and augmented data. The results are presented in tables and figures, showing that even for lowresource languages, the models can achieve good performance metrics. The authors also discuss the implications of their findings and the potential for transferring skills to other languages. The paper provides valuable insights for researchers working in the field of speaker verification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In today's digital age, speech recognition and speaker verification techniques have become
increasingly important for a variety of applications. These technologies have revolutionized the way we
interact with machines, allowing for seamless communication and automation in various fields, from
personal assistants to security systems. Speech recognition refers to the ability of machines to identify
and transcribe human speech, while speaker verification focuses on verifying the identity of the person
speaking. Both technologies have numerous practical applications, including improving accessibility
for individuals with disabilities, enhancing the user experience of devices and applications, and
enhancing security measures in industries such as banking and finance. Thus, understanding the
importance and potential of speech recognition and speaker verification is crucial for those interested
in the future of technology and its impact on society.</p>
      <p>
        Despite the vast potential of speech recognition and speaker verification technologies, there are still
significant challenges in implementing them for low-resource languages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] like Ukrainian. Many of
these languages lack the necessary data and resources to develop robust and accurate models. However,
recent advancements in machine learning, particularly in deep learning techniques, have made it
possible to overcome some of these limitations and enable the development of speech and speaker
recognition models for these languages.
      </p>
      <p>The potential impact of these technologies on low-resource languages is immense. Speech
recognition can greatly improve accessibility for individuals who speak these languages, allowing them
to communicate more effectively with technology and access a wider range of digital content. Speaker
verification can also enhance security measures in industries like finance and government, enabling
secure authentication of individuals who speak these languages.</p>
      <p>Furthermore, the development of speech recognition and speaker verification models for
lowresource languages can have broader socio-economic benefits. For example, it can improve the</p>
      <p>2023 Copyright for this paper by its authors.
efficiency and accuracy of customer service for businesses operating in these regions, increasing
customer satisfaction and loyalty. It can also facilitate the development of new tools and applications
that are specifically tailored to the needs of these populations, enhancing their digital literacy and
participation in the global digital economy.</p>
      <p>The aim of the work is to develop an approach to perform highly accurate (comparable with
performance of SOTA models for resource-rich language) speaker verification for low-resource
languages like Ukrainian.</p>
      <p>So the goal of this work is to:
• Develop a robust speaker verification system
• Study the effectiveness of transferring skills of speaker verification from other languages
• Compare the effectiveness of different approaches and algorithms</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Speaker verification is the process of verifying the identity of a person based on their voice. This
process is often used in security systems, access control and other applications where identification is
required. However, speaker verification systems are typically designed for high resource languages,
leaving low resource languages with limited options. In this literature review we will explore the
stateof-the-art research on speaker verification for low resource languages.</p>
      <p>In recent years, researchers have attempted to address the issue of speaker verification for low
resource languages by developing systems that are capable of identifying individuals who speak less
common languages. These efforts have been driven by the need to ensure that all people, regardless of
their language, can have access to secure and reliable identification systems.</p>
      <p>
        One approach that has been used to overcome the lack of resources for low resource languages is
data augmentation. This technique involves creating new data from existing data by applying various
transformations such as pitch shifting, noise addition and speed variation. In a study by Chen et al.
(2021) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the authors proposed a data augmentation method for speaker verification in low resource
languages using a combination of noise addition, reverberation and pitch shifting. The authors reported
that their proposed method outperformed the baseline approach, which only used the original data.
      </p>
      <p>
        Another approach that has been explored is transfer learning, which involves training a model on a
resource-rich language and then fine-tuning it for a low resource language. In a study by Sigtia et al.
(2018) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the authors proposed a transfer learning method for speech recognition in Swahili, a low
resource language spoken in East Africa. The authors trained a deep neural network (DNN) on a large
dataset of English speech and then fine-tuned the model on a smaller dataset of Swahili speech. The
authors reported that their proposed method outperformed the baseline approach, which only used the
small dataset of Swahili speech.
      </p>
      <p>
        In addition to data augmentation and transfer learning, other approaches have also been explored,
such as unsupervised speaker adaptation and speaker diarization. Unsupervised speaker adaptation
involves adapting a pre-trained model to a new speaker without requiring any labeled data. In a study
by Gautam et al. (2019) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors proposed an unsupervised speaker adaptation method for
speaker verification in Hindi, a low resource language spoken in India. The authors reported that their
proposed method outperformed the baseline approach, which required labeled data.
      </p>
      <p>In conclusion, the research on speaker verification for low resource languages is an emerging area
of study, and several approaches have been proposed to address this issue. Data augmentation, transfer
learning, unsupervised speaker adaptation and speaker diarization are some of the approaches that have
been explored. While these approaches have shown promise, there is still much work to be done to
develop accurate and reliable speaker verification systems for low resource languages.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and materials</title>
      <p>Consider the data that will be used in further experiments, some other materials and methods
proposed to solve the problem under consideration.</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset Description</title>
      <p>
        As a base dataset we have chosen Common Voice dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It`s a crowd source dataset that
contains a lot of audio recordings of different speakers even for low resource languages like Ukrainian.
Large number of speakers is essential to build robust speaker verification system. The Ukrainian dataset
contains 73 hours of recording of 120 speakers in train split and 14 hours of 639 speakers in test split.
      </p>
      <p>In some experiments we additionally use datasets in other languages (language 1 and language 2) in
training process. The data about duration and number of unique speakers is presented in Table 1.</p>
      <p>In order not to overfit on speakers with a few number of recordings we dropped speakers with less
than 40 recordings from training dataset. On figure 1 shown the histograms of number of recordings
per speaker.</p>
      <p>Also in some experiments we limit the number of recordings by speaker for Ukrainian recordings in
order to make dataset no balanced.</p>
      <p>As a step of feature extraction we split each audio into 3 second chunks and extracted spectrogram
from them. After it we normalized them and from this step we could process them like images.</p>
      <p>During the training process in some experiments, we applied Mell spectrogram augmentations such
as time and frequency masking. This was done in order to prevent overfitting and make model robust
to real-world data.</p>
      <p>In order to evaluate model on real-world data we additionally collected recordings of interviews,
department meeting in Google Meet, etc.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Methods</title>
      <p>We have chosen as key metrics:
1) Equal error rate – it is one of the most widely used metric to evaluate speaker verification
models.
2) Precision Top-K – spends for fraction of examples in Top-K most similar data points with the
original one. In our experiments we used K equal to (3, 5, 10).
3) Mean and standard deviation of positive similarity – mean and standard deviation of cosine
similarity between examples of the same class.
4) Mean and standard deviation of negative similarity – mean and standard deviation of cosine
similarity between examples that do not belong to the same class as an original data point.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Experiment</title>
      <p>
        We have chosen as a beck bone a ConvNext [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] because it`s one of the best performing convolutional
neural network architectures in computer vision tasks, such as an ImageNet. We used randomly
initialized weights, because Mel spectrograms are completely different, from datasets network was
trained on so it`s unlikely that pre training will give an advantage in the task of speaker verification.
Because of limitations in computation resources we have only tried to use small and tiny version of it.
      </p>
      <p>
        In order to improve the model we were experimenting with the DOLG architecture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which
showed SOTA results in face recognition and image retrieval. It originally used ResNet as a backbone,
so we have to adapt it to our task and ConvNext as a backbone.
different loss function: Triplet loss, ArcFace loss, Sub center Arcface loss, ArcFace loss + Triplet
loss, Sub center Arcface loss + Triplet loss.
3) Compare the performance of the best architecture from previous experiments training on large
datasets, which includes other languages. Validation is performed only on Ukrainian dataset. This
experiment will help to determine possibility of transferring skills from other languages in the task
of speaker verification. Since the size of dataset is increased network is trained only for 4 epochs.
4) Compare the performance of the best model from previous experiment with the same model as
a backbone in DOLG architecture. This experiment will help to identify the possibility of applying
DOLG architecture in the task of speaker verification.
5) Compare the performance of the model from previous experiments trained on clean data and
augmented data. This experiment is aimed to determine how the usage of augmented data effects the
training proses.
      </p>
      <p>After all of this experiments we perfumed speaker diarization on our dataset using the best model from
previous experiments. To achieve it we split audio into parts of 3 seconds, transformed it to Mell
spectrogram and got embedding by our model. Then they were clustered using KMeans algorithm.</p>
      <p>Training will be carried out in the Kaggle environment using P100 GPU.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Results 5.1.</title>
    </sec>
    <sec id="sec-8">
      <title>ML Results</title>
      <p>The results of the experiments are shown in Figures 4 – Figure 6 and in Tables 2. All the graphs are
shown in the appendix A.</p>
      <p>Figure 6 shows the change of loss and precision in the top 3 during the fourth and fifth experiments.
DOLG architecture shows better initial performance and better performance in general. Also data
augmentations slightly improved the model`s performance and robustness to new data.</p>
      <p>mean_neg mean_pos std_neg std_pos eer_mean
4,779827 0,006445 0,512067 0,06615 0,191957 0,05745
2,266273 0,143715 0,621793 0,136854 0,188838 0,120028
1,056141 0,098119 0,543737 0,112699 0,226304 0,130431
0,819421 0,012565 0,559775 0,122927 0,183735 0,068941
20,94974 0,064958 0,634587 0,18114 0,201048 0,116371
12,71726 0,049018 0,682394 0,205036 0,192486 0,105805
convnext_tiny sub center</p>
      <p>Speaker verification</p>
      <p>convnext_small
Speaker verification
convnext_tiny arcface</p>
      <p>In order to test model performance on the real-world data we performed speaker diarization of
Google Meet call between 2 speakers. First of all, we split the audio into windows of 3 seconds each.
Next we transformed the raw audio into mel spectrograms and extracted embeddings using our model.
These embeddings were clustered using KMeans algorithm. The results of clusterization are shown on
figure 7.</p>
      <p>As we can see from the plots, there are 2 large clusters which represents speakers. The boundary
region between clusters represents fragments, where both speakers are active.</p>
      <p>As a next step each embedding was matched with the corresponding timestamp. The result was
formatted according to srt format and is shown on figure 8.</p>
      <p>In conclusion, model trained for speaker verification showed good results in the task of speaker
diarization on real-world data.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Discussions</title>
      <p>As a result of the first experiment it was shown, that even for low-resource languages models can
achieve quite good performance metrics. Also results of both convnext tiny and convnext small are
quite similar. For both networks we can see that after the 6th epoch the precision at n starts to decrees
or stay approximately the same. That may indicate the overfitting of the networks. Also after the 6th
epoch negative std reached plateau and don’t decrease as fast as before. On the other hand, std of
positive examples is constantly increasing over training. So it was decided to use convnext tiny, because
it has less parameters, so following experiments can be performed faster. The question of performance
of large networks (like base, or large) is still open, so probably they can perform better in the task of
speaker verification.</p>
      <p>In the second experiment we compared different loss functions. In the end all networks performed
approximately the same. But losses that contain triplet loss performed a bit worse than ArcFace and sub
center ArcFace. These losses reached plateau faster and convergence slower. In general, all the metrics
follow the same trend like in previous one. We have chosen sub center arcface because it shows more
robustness to a new data, while keeping good performance metrics.</p>
      <p>In the third experiment we compared the model trained only on one language with trained on
multilingual dataset. Multilingual models show a superior metrics on test Ukrainian set and achieve
better results in general. But the model that was trained on fully multilingual dataset reached plateau
faster than one trained one balanced (where number of recordings per speaker is approximately the
same as in a target language), which may indicate overfitting to languages with more speakers. So
transferring of skills for low resource languages, like Ukrainian, from other languages is quite effective,
but in order to achieve better results, dataset should be balanced.</p>
      <p>In the fourth experiment we compared ConvNext with DOLG pipeline with ConvNext as a backbone
on balanced multilingual dataset. DOLG shows superior results, and pretty much achieved SOTA result
in the task of speaker verification. In addition, it was trained only for 6 epochs, so it may possible
achieve better results with further training.</p>
      <p>In the fifth experiment we applied augmentations to spectrogram and repeated previous experiment.
As a result, the model achieved even better level of performance and robustness.</p>
      <p>Next we tried to analyze with the help of the model from last experiment real-word data – Google
Meet call of 2 people. So it performed quite well. However, sometimes if there was no sound, the model
can treat it as a separate speaker. So in conclusion we recommend to use ConvNext tiny as a backbone
in DOLG pipeline to achieve SOTA results.</p>
    </sec>
    <sec id="sec-10">
      <title>7. Conclusions</title>
      <p>The paper presents the study on speaker verification using deep learning models. The study used
four key metrics to evaluate the performance of the models: equal error rate, precision top-K, mean and
standard deviation of positive similarity, mean and standard deviation of negative similarity. The study
compared the performance of different network architectures, such as ConvNext and DOLG, and
different loss functions, such as TripletLoss, ArcFace, and Sub center ArcFace. The study also
compared the performance of models trained on single language datasets and those trained on
multilingual datasets.</p>
      <p>The experiments showed that even for low-resource languages, the models can achieve quite good
performance metrics. The results indicated that the ConvNext tiny model performed better than the
ConvNext small model. The study also found that Sub center ArcFace loss showed more robustness to
new data while maintaining good performance metrics. Furthermore, the study showed that transferring
skills from other languages to low-resource languages was quite effective in achieving better
performance metrics. Finally, the study performed speaker diarization on the dataset using the best
model from previous experiments, achieving good results.</p>
      <p>In conclusion, the study demonstrated the effectiveness of deep learning models in the task of
speaker verification, even for low-resource languages. The study provides insights into the
bestperforming network architectures and loss functions for this task and shows the potential for transferring
skills from other languages to low-resource languages. The findings of this study could have significant
implications for developing better speaker verification systems.</p>
      <p>The perspective of future studding includes comparison of large amount of convolutional neural
network architectures (especially with large number of parameters), different loss functions and their
combinations. Also it`s quite important to study transfer learning from other languages and perform
multilingual speaker verification.</p>
    </sec>
    <sec id="sec-11">
      <title>8. References</title>
      <p>Appendix A</p>
      <p>a) b)
Figure A.1: Change of metric during the first experiment a) mean positive similarity b) loss
a) b)
Figure A.2: Change of metric during the first experiment a) precision at top 3 b) precision at top 3
Figure A.3: Change of metric during the first experiment a) negative standard deviation b) mean
negative similarity
Figure A.6: Change of metric during the second experiment a) precision at top 5 b) negative standard
deviation</p>
      <p>a) b)
Figure A.7: Change of metric during the second experiment a) mean negative similarity b) precision at
top 10</p>
      <p>a) b)
Figure A.9: Change of metric during the third experiment a) equal error rate b) mean positive similarity
Figure A.12: Change of positive standard deviation during the third experiment
Figure A.15: Change of metric during the fourth and fifth experiments a) mean negative similarity b)
precision at top 10</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Erdem</surname>
          </string-name>
          et al.,
          <article-title>'Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning'</article-title>
          .
          <fpage>06</fpage>
          -Apr-
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zevallos</surname>
          </string-name>
          , '
          <article-title>Text-To-Speech Data Augmentation for Low Resource Speech Recognition'</article-title>
          . arXiv,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pellegrino</surname>
          </string-name>
          , '
          <article-title>Developments of Swahili resources for an automatic speech recognition system'</article-title>
          ,
          <source>in Workshop on Spoken Language Technologies for Under-resourced Languages</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brummer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mccree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia-Romero</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Vaquero</surname>
          </string-name>
          , '
          <article-title>Unsupervised Domain Adaptation for I-Vector Speaker Recognition'</article-title>
          ,
          <source>in Proc. The Speaker and Language Recognition Workshop (Odyssey</source>
          <year>2014</year>
          ),
          <year>2014</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ardila</surname>
          </string-name>
          et al.,
          <article-title>'Common Voice: A Massively-Multilingual Speech Corpus'</article-title>
          ,
          <source>in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>4211</fpage>
          -
          <lpage>4215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-Y. Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Feichtenhofer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Darrell</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
          </string-name>
          , '
          <article-title>A ConvNet for the 2020s'</article-title>
          ,
          <source>in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>11966</fpage>
          -
          <lpage>11976</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          et al.,
          <string-name>
            <surname>'</surname>
            <given-names>DOLG</given-names>
          </string-name>
          :
          <article-title>Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local</article-title>
          and Global Features',
          <source>2021 IEEE/CVF International Conference on Computer Vision</source>
          (ICCV), pp.
          <fpage>11752</fpage>
          -
          <lpage>11761</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hoffer</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Ailon</surname>
          </string-name>
          , '
          <article-title>Deep Metric Learning Using Triplet Network'</article-title>
          ,
          <source>in Similarity-Based Pattern Recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>84</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xue</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          , '
          <article-title>ArcFace: Additive Angular Margin Loss for Deep Face Recognition'</article-title>
          ,
          <source>in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4685</fpage>
          -
          <lpage>4694</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          , '
          <article-title>Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces'</article-title>
          , in Computer Vision --
          <source>ECCV</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>741</fpage>
          -
          <lpage>757</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Musgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.-N.</given-names>
            <surname>Lim</surname>
          </string-name>
          , '
          <article-title>PyTorch Metric Learning'</article-title>
          . arXiv,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , '
          <article-title>AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations'</article-title>
          ,
          <source>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pp.
          <fpage>10815</fpage>
          -
          <lpage>10824</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Yerokhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Babii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Nechyporenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. P.</given-names>
            <surname>Turuta</surname>
          </string-name>
          ,
          <article-title>A Lars-Based Method of the Construction of a Fuzzy Regression Model for the Selection of Significant Features, Cybernetics and Systems Analysis</article-title>
          , Vol.
          <volume>52</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>4</given-names>
          </string-name>
          , (
          <year>2016</year>
          ),
          <fpage>641</fpage>
          -
          <lpage>646</lpage>
          . https://doi.org/10.1007/s10559-016- 9867-5
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>