<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An efficient Multilingual Speaker Recognition system using fusion technique</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mayur Rahul</string-name>
          <email>mayurrahul209@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonu Kumar Jha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarvachan Verma</string-name>
          <email>sarvachan.verma@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vikash Yadav</string-name>
          <email>vikas.yadav.cs@gmail.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Devendra Kumar Dellwar</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ajay Kumar Garg Engineering College</institution>
          ,
          <addr-line>Ghaziabad, Uttar Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Application, UIET CSJM Kanpur Nagar</institution>
          ,
          <addr-line>Uttar Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Government Polytechnic Bighapur Unnao, Department of Technical Education</institution>
          ,
          <addr-line>Uttar Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Krishna Engineering College</institution>
          ,
          <addr-line>Ghaziabad, Uttar Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>SR Group of Institutions</institution>
          ,
          <addr-line>Jhansi, Uttar Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The robustness and performance of speech signal based framework depends on the quality of features. In the today's era of research, working of single feature might not be enough to cover both robustness and performance simultaneously. In order to resolve this problem, researchers use multiple sources by applying various fusion techniques. These fusion techniques are categorized into few categories: Model level, Feature level and Score level combination scheme. The documents available in previous research shows the features available from different sources are used to enhance the strengths and recognition rate of the system. Even though these fusion techniques enhance the strengths and recognition rate of the system, but they found some demerits in the system. This will helps us to investigate further. The aim of the work is to introduce a system for multilingual speaker system with the help of SVM using fusion technique. The objective is to explore the advantage of various fusion techniques and how these techniques are useful to build efficient system for multilingual speaker system. The results from our proposed system indicate goodness of our work.</p>
      </abstract>
      <kwd-group>
        <kwd>Multilingual speaker recognition</kwd>
        <kwd>SVM</kwd>
        <kwd>fusion techniques</kwd>
        <kwd>model level 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The speech recognition can be classified into two categories: speech recognition and speaker
recognition. These systems consist of extracting important information form speech signals
and identifying the required results by machine. In the case of speaker recognition, the
machine tries to retrieve information based on any specific criteria from given speech signals
and in speech recognition, only textual information is extracted from speech signals. They are
similar to the pattern recognition systems. The accuracy of the system is depending on the
discriminating power of the features used in the process. The feature extraction generally
depends on the type of tasks. In case of speaker recognition, the machine calculates linear
prediction cepstral coefficients (LPCC) or mel-frequency cepstral coefficients (MFCC)
characteristics which represents speaker based vocal information in precise form [1, 2, 3].</p>
      <p>0000-0002-2394-865X (M. Rahul); 0009-0009-4378-7302 (S. K. Jha); 0009-0003-7588-9449 (S. Verma);
0000-0003-1348-1379 (V. Yadav); 0009-0007-8928-2321 (D. K. Dellwar)
© 2023 Copyright for this paper by its authors.</p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>CEUR Workshop Proceedings (CEUR-WS.org)</p>
      <p>
        The researchers also explore speaker based information as alternative proof using various
fusion methods. These methods provide better performance as compared to the independent
vocal based systems. Moreover, these systems are comparably robust against various
conditions [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ]. The MFCC characteristics retrieved from phoneme samples are used as main
characteristics for speech recognition systems. These MFCC characteristics shows the
spectral envelope design of various phonemes, which are used for speech recognition system.
The speech recognition systems is a speaker independent procedure, therefore need huge
amount of information to efficiently represent the phoneme based information. To remove
those complications people use much information. Tripathi et al. proposed different kinds of
source information and then incorporated with MFCC characteristics by using given fusion
methods [
        <xref ref-type="bibr" rid="ref2">5</xref>
        ]. They have also showed that combination of source information and MFCC
characteristics not only enhance accuracy rate but also improves the robustness of phoneme
recognition process.
      </p>
      <p>
        The information consists of source excitation is generally used as additional proof with
tract information to get enhanced information in various speech recognition systems [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">5, 6, 7</xref>
        ].
The purpose for using source based excitation information as additional proof has two
reasons: people use excitation features like duration, intonation and pitch to identify speakers
as well as the matter of the speech data [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">8, 9, 10, 11</xref>
        ]. People have proven themselves
powerful even in decadent conditions, representing the capability of the excitation source
data [
        <xref ref-type="bibr" rid="ref9">12</xref>
        ]. The other reason is the approbative description of source and vocal information.
This approbative description gives additional proof that is use to enhance the performance
and robustness of the baseline framework. The researchers also observe that combination of
source excitation information and vocal tract enhance the robustness and performance of the
speaker and speech recognition framework [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">13,14,15</xref>
        ].
The performance of combined given system depends on importance of the features as well
as on the suitable fusion methods. The optimized advantage can be found with the help of
suitable fusion and effective features. Since excitation source and MFCC information are
paramount for various speeches processing frameworks [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">17, 18, 19</xref>
        ], getting optimized
performance mostly based on the applied fusion technique. The fusion of features can be
processed at the comparison, model or feature level. This could be better explained from the
diagram of speech recognition shown in figure 1. The speech sample is processed to make
them input for feature extraction step in pre-processing steps. The purpose of the feature
extraction step is to calculate the required features by applying various signal processing
techniques. In feature level fusion, various features are calculated and fused for creating
models. A similar technique is followed to calculate the test characteristics and used for
matching. In case of Model based fusion, various models are created using individual feature
sets. Further, the different models parameters are combined to create composite models.
Finally, the comparison is created with test speech specimen and composite model. In score
level fusion, different characteristics are obtained from given voice signal and used to create
the corresponding models. During matching, the given features are matched with
corresponding models, and calculate individual score. These score are combined to give final
score.
      </p>
      <p>In the speech recognition system, features represent the corresponding information about
the job in a precise form. These given features are then used for creating blocks for various
classes/pattern. For example, phonemes design in automatic speech recognition and speakers
design for automatic speaker recognition. The existing work shows that instead of using
single features, fusion of given multiple features gives optimized classes for speech based
pattern identification tasks. Moreover, fusion of various features not only enhances the
robustness but also performance of the systems. For example, recent researchers have shown
the benefits of different features speech recognition systems, for automated speech
recognition, and replay identification systems. In these researches, the fusion based
techniques are limited to combination of features at every level. These techniques have their
own advantages and disadvantages. A combined fusion technique could be created by
utilizing the advantages of individual combination scheme which can be effective, efficient
and useful for different speech processing systems. The target is to elaborate the advantages
of different speaker recognition systems and apply them for the improvement of the effective
recognition scheme. The main findings of the research work are as follows:
(1) The paper introduces a literature review of various types of speaker identification
systems with its historical background.
(2) The paper summarizes the feature extrication, datasets, accuracy and demerits of
existing work.
(3) The paper introduces the SVM based multilingual speaker recognition using MPDSS,</p>
      <p>RMFCC, and MFCC features.
(4) The paper introduces the combination of MPDSS, RMFCC, and MFCC in TIMIT,</p>
      <p>NIST 2003 SRE datasets.
(5) The performance of our paper is best when compare with other existing work.</p>
      <p>The remaining part of paper is sketched as follows. In section II, we explain the related
works. Section III presents the research methodology used in multilingual speaker
recognition. The experiments and results are demonstrated in Section IV. Finally, Section V
concludes the paper with future works.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works:</title>
      <p>The speech recognition system refers to extrication of important information from speech
signal by using different signal processing techniques for some applications. People’s speech
reflects effectively the textual content and speaker information recognition. The speech
processing systems are generally categorized into two categories: speech recognition and
speaker recognition. The extrication of textual data present in speech is called speech
recognition system, and the speaker data is used to identify speaker is called as speech
recognition system. We consider the systems related to above two fields as benchmark to
represent the robustness of the proposed method. A detailed explanation about the speech and
speaker recognition is given in present section.</p>
      <p>The method of identifying people by machine using the data available in speech samples is
called speech recognition systems (SRS). The SRS is broadly categorized in two categories:
Automatic speaker verification system (SVS) and speaker identification system (SIS). In SIS,
the objective of the machine is to detect the speaker from the given test samples, whereas, in
SVS, the objective is to verify the particular identity with the help of given speech samples.</p>
      <p>
        The entire SRS process consists of two parts: training and testing [
        <xref ref-type="bibr" rid="ref17">20</xref>
        ]. In training step,
machine gathers the given speech sample from the speaker and register them by using SRS
technique. The training step consists of feature extraction and creating models. The speaker
based information is retrieved in feature extraction step from each and every sample by using
various signal processing techniques and represent it in parametric form. These important
features are then used in modeling stage to create model. In testing step, the machine
calculates the speaker based features from test sample by using same feature extraction
technique as used in training step, and used to compare with the existing model. Depend on
the task, comparison processed in the comparison step. The comparison steps gives matching
score that identify of the speaker for the speech samples.
      </p>
      <p>
        The existing systems predominantly use cepstral computing technique for feature
extrication and probabilistic technique like Gaussian Mixture Model (GMM) [
        <xref ref-type="bibr" rid="ref18 ref19">21,22</xref>
        ]. Based
on the given speech samples the SRS are classified into two categories: Text independent and
text dependent. In case of text dependent, the speakers kept for test are required to present
same speech sample as given at enrollment process. There is no textual limitation on text
independent model. They are used for real-time.
      </p>
      <p>In the field of speaker recognition, additionally two research areas including limited data
based speaker verification and replay attacks identification. In comparison with traditional
speaker verification system, limited data based speaker verification requires less amount of
data for testing and training processes. As smaller amount of data is used, the limited data
based speaker identification is very challenging task in the area of speaker recognition. The
replay attack is a kind of spoofing attack to automated speaker verification task, where the
decisions can be changed by prerecorded speaker samples by recording and playback devices.
It doesn’t require any technical knowledge, only a smart phone is needed for spoofing. The
existing reviews shows that replay attack is highly efficient and effective and easily
accessible constitute a critical threat to automated speaker verification.</p>
      <p>Speaker identification is a method to identifying the speakers by using speech samples. A
set of well known speakers are enrolled by the machine and used as reference patterns for
recognizing the unknown speaker. The speaker identification system is performed in two
steps: testing and training steps. In training step, individual speaker based features are
retrieved from the set of speakers and used to create respective reference models. In common
excitation source based information and vocal tract are used for creating reference models. In
testing step, the same speaker based features are extricated from the test based speaker
samples, and used for matching with the entire stored speaker design for recognition.</p>
      <p>The speaker verification is the method of identifying the unknown applicant to a reference
design by given speech samples. It is very clear that the applicant should be registered by the
machine before placing the application. So, firstly applicant is asked to give speech samples
for registration. Further, during verification, the voice samples are compared by matching
with the corresponding samples. The decision is purely based on the threshold. The matching
score is greater than threshold, it is accepted otherwise rejected.</p>
      <p>The limited data based speaker verification refers to an identification task where the
availability of testing and training data is very less say less than 10 sec. The forensic based
investigation where data is less, performance is mostly affected. This is also affected due to
inadequate coverage of the speech samples. So, effective and efficient technique is required
for these conditions.</p>
      <p>
        The automated speaker verification is generally applied without human directions. In that
particular circumstance, it is possible that a fraud may fool the system by fake speech
samples of any speaker. In the field of speaker recognition, fraud in automated speaker
verification system by giving fake speech samples is called as spoofing. In case of speaker
verification, spoofing can be processed with the help of four techniques: voice conversion,
speech synthesis, replay attack, and impersonation. The impersonation is the method where a
fraud try to generate the speech by voice mimicry [
        <xref ref-type="bibr" rid="ref20 ref21">23,24</xref>
        ]. The replay attack is the method of
changing the decision of automated speaker verification with the help of pre-recorded speech
samples through playback and recording devices [
        <xref ref-type="bibr" rid="ref22 ref23">25,26</xref>
        ]. The speech synthesis and voice
conversion techniques requires deep speech processing and signal processing knowledge and
also large amount of data to produce synthesized voice. On comparison, spoofing through
replay and record do not need speech processing information. The replay attack could be
easily obtained by using good quality playback and recording devices. An existing research
reports on spoofing to automated speaker verification says that replay attack is highly
effective and easily accessible.
      </p>
      <p>
        FEATURE SET 1
FEATURE SET 2
(
F
ETA FEA
3. Proposed Methodology: In speech processing systems, fusion of different features for
the making of efficient and effective models is called as feature based fusion technique. The
motive behind the feature based fusion technique is that every individual feature contains few
important features that may be missed by different models. In feature based model technique,
different features are merged and then used for creating models [
        <xref ref-type="bibr" rid="ref24">27</xref>
        ]. The general block
diagram of speech oriented speech recognition system by applying feature based scheme is
shown in figure 2. In training step, the input voice signal is proceed across the pre-processing
step and then various features are calculated by using various signal processing techniques.
These individual features are merged to build a combined feature, which is further used for
creating reference models. The point to be noted that on concatenating the fusion of different
features not required to be of same dimensions. At the time of simulation a same method is
followed to create composite features for matching.
      </p>
      <p>
        The first research was done by applying the feature based fusion scheme by Fururi et al.
[
        <xref ref-type="bibr" rid="ref25">28</xref>
        ]. The author has proposed the concatenation of well-known features with the first and
second order polynomials having form of DeltaDelta and Delta coefficients and then applied
for speaker recognition systems. On comparison with cepstral features, the concatenation of
various features reduces the error rate by 30% [
        <xref ref-type="bibr" rid="ref26 ref27">29,30</xref>
        ]. Further, fusion of different techniques
reduces the speaker identification and verification process by 1.43% and 37.5% respectively
[
        <xref ref-type="bibr" rid="ref28 ref29 ref30">31,32,33</xref>
        ]. The aforementioned research shows that fusion of different features helps in
enhancing the robustness and performance of the different speaker recognition systems. So,
due to this reason we have applied this technique in our proposed framework [
        <xref ref-type="bibr" rid="ref31 ref32">34,35</xref>
        ].
      </p>
      <p>In this proposed technique, different feature sets are computed separately and
corresponding models are created with the help of particular modeling method. The feature
based model is defined by their corresponding modeling variables. The introduced
combination technique produces composite model by padding variables of corresponding
feature based models. The padding technique is used to reduce the dimensions of the features
and also reduces the computational complexity of the model. Additionally, the difficulty
arises due to mapping of different modeling parameters can be avoided by same modeling
technique. During testing, feature vectors of different parameters are set in same manner and
put before for evaluation. The proposed fusion scheme based method is different by the fact
that they are based on combined opinions. The scores produced by introduced technique are
exactly used for matching without using any weights. Further, the proposed technique is more
appropriate for real time systems.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and results:</title>
      <p>The three MFCC, RMFCC, and MPDSS features are broadly used as features to show
excitation source data. We are able to give experimental recognition report in this section to
pick the suitable source excitation feature, particularly in the situation of using it as
additionally demonstrate the speaker recognition system. On the basis of performance and
robustness, the specific feature that is used to give optimized performance is further chosen.</p>
      <p>
        We carry out speaker recognition process by using GMM technique with TIMIT dataset,
and the speaker verification process by using GMM-UBM with NIST-2003 SRE dataset. In
speaker recognition system, processing of signals takes place at 7500 samples per sec and
unvoiced and voiced identification are done by thresholding based on energy. The features
are calculated from 25 msec overlapping speech frames at the rate of 90 frames per sec with
the help of most recent literatures. We have consider the suggestions of prasanna et al. to
derive the residual signal LP to calculate the MPDSS from 25 LP residual power spectrum
and 25 dimensional features are used as MPDSS features[
        <xref ref-type="bibr" rid="ref27">30</xref>
        ]. In the same way, the 13
melcepstral coefficients combine with 13 Delta and DeltaDelta is calculated from LP signals and
speech to get RMFCC and MFCC features.
      </p>
      <p>The speaker recognition performance of MFCC, MPDSS, and RMFCC features with
TIMIT dataset are reported in table 1. The individual accuracy of these features is assessed
using GMM modeling techniques. The MFCC feature produces the recognition rate of
96.14%, whereas an RMFCC and MPDSS feature gives the recognition rate of 83.74% and
74.35% respectively. It is observed that among these features, MFCC performs the best
accuracy. The feature based fusion technique between all these features are given in the table
1, it is observed that fusion of RMFCC and MFCC produces the best recognition rate of
97.14%.</p>
      <p>The speaker verification process (EER) result is calculated for all three MFCC, RMFCC,
and MPDSS features are depicted in table 2. The same trend is observed for speaker
verification system as observed in speaker recognition system. The error rate for the MFCC
gives the beat error rate of 6.94% as compared to MPDSS and RMFCC having 20.24 and
18.10 respectively. The feature based fusion technique is used for all the three features and
found that combination of RMFCC and MFCC produces best error rate of 5.94% as
compared to MPDSS+ RMFCC and MPDSS+ MFCC of 17.24% and 6.12% respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and future works:</title>
      <p>The robustness and performance of speech signal based framework depends on the quality of
features. In the today’s era of research, working of single feature might not be enough to
cover both robustness and performance simultaneously. In order to resolve this problem,
researchers use multiple sources by applying various fusion techniques. These fusion
techniques are categorized into three categories: Model level, Feature level and Score level
combination scheme. We have used feature based fusion technique in our research. The SVM
is used as a classification technique after combining different features. We have also proved
that our speaker recognition and speaker verification framework works well with MFCC with
TIMIT and NIST-2003 SRE dataset. Further, the fusion technique gives better results as
compared to existing work. In future, more features will be added to enhance the recognition
rate of speaker recognition and speaker verification system and try to incorporate some more
deep learning methods.</p>
      <p>References:
[1] D. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech
Communication, vol. 17, pp. 91–108, 1995.</p>
      <p>[2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker
verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798,
2011.</p>
      <p>[3] R. K. Das and S. Mahadeva Prasanna, “Exploring different attributes of source information for speaker
verification with limited test data,” The Journal of the Acoustical Society of America, vol. 140, no. 1, pp. 184–
190, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pati</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Prasanna</surname>
          </string-name>
          , “
          <article-title>Speaker verification using excitation source information</article-title>
          ,”
          <source>International journal of speech technology</source>
          , vol.
          <volume>15</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>257</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Tripathi</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Rao</surname>
          </string-name>
          , “
          <article-title>Improvement of phone recognition accuracy using speech mode classification</article-title>
          ,”
          <source>International Journal of Speech Technology</source>
          , vol.
          <volume>21</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>489</fpage>
          -
          <lpage>500</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yegnanarayana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Prasanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zachariah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , “
          <article-title>Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system,” IEEE Transactions on speech and audio processing</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>575</fpage>
          -
          <lpage>582</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pati</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Prasanna</surname>
          </string-name>
          , “
          <article-title>Speaker information from subband energies of linear prediction residual,” in 2010 National Conference on Communications (NCC)</article-title>
          . IEEE,
          <year>2010</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thévenaz</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Hügli</surname>
          </string-name>
          , “
          <article-title>Usefulness of the lpc-residue in text-independent speaker verification,” Speech Communication</article-title>
          , vol.
          <volume>17</volume>
          , no.
          <issue>1-2</issue>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>157</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pati</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. R. M.</given-names>
            <surname>Prasanna</surname>
          </string-name>
          , “
          <article-title>Subsegmental, segmental and suprasegmental processing of linear prediction residual for speaker information</article-title>
          ,”
          <source>International Journal of Speech Technology</source>
          , vol.
          <volume>14</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>64</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K. S. R.</given-names>
            <surname>Murty</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Yegnanarayana</surname>
          </string-name>
          , “
          <article-title>Combining evidence from residual phase and mfcc features for speaker recognition</article-title>
          ,
          <source>” IEEE signal processing letters</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Feustel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Logan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Velius</surname>
          </string-name>
          , “
          <article-title>Human and machine performance on speaker identity verification,”</article-title>
          <source>The Journal of the Acoustical Society of America</source>
          , vol.
          <volume>83</volume>
          , no.
          <issue>S1</issue>
          , pp.
          <fpage>S55</fpage>
          -
          <lpage>S55</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Feustel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Logan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Velius</surname>
          </string-name>
          , “
          <article-title>Human and machine performance on speaker identity verification,”</article-title>
          <source>The Journal of the Acoustical Society of America</source>
          , vol.
          <volume>83</volume>
          , no.
          <issue>S1</issue>
          , pp.
          <fpage>S55</fpage>
          -
          <lpage>S55</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Mashao</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Skosan</surname>
          </string-name>
          , “
          <article-title>Combining classifier decisions for robust speaker identification,” Pattern Recognition</article-title>
          , vol.
          <volume>39</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ohtsuka</surname>
          </string-name>
          , “
          <article-title>Speaker identification and verification by combining mfcc and phase information,” IEEE transactions on audio, speech, and language processing</article-title>
          , vol.
          <volume>20</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>1085</fpage>
          -
          <lpage>1095</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Manjunath</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Rao</surname>
          </string-name>
          , “
          <article-title>Source and system features for phone recognition</article-title>
          ,”
          <source>International Journal of Speech Technology</source>
          , vol.
          <volume>18</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [16]
          <string-name>
            <surname>James</surname>
            ,
            <given-names>Praveen</given-names>
          </string-name>
          &amp; Mun, Hou &amp; Vaithilingam, Chockalingam &amp; Chiat,
          <string-name>
            <surname>Alan.</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Recurrent neural network-based speech recognition using MATLAB</article-title>
          .
          <source>International Journal of Intelligent Enterprise. 7. 56. 10</source>
          .1504/IJIE.
          <year>2020</year>
          .
          <volume>104645</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kittler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hatef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Duin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          , “On combining classifiers,
          <source>” IEEE transactions on pattern analysis and machine intelligence</source>
          , vol.
          <volume>20</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>239</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Furui</surname>
          </string-name>
          , “
          <article-title>Cepstral analysis technique for automatic speaker verification</article-title>
          ,
          <source>” IEEE Trans. Acoust</source>
          . Speech, and Signal Process., vol.
          <volume>29</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>254</fpage>
          -
          <lpage>272</lpage>
          , Apr.
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Venturini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Coelho</surname>
          </string-name>
          , “
          <article-title>On speech features fusion</article-title>
          , _
          <article-title>-integration Gaussian modeling and multi-style training for noise robust speaker classification</article-title>
          ,
          <source>” IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          , vol.
          <volume>22</volume>
          , no.
          <issue>12</issue>
          , pp.
          <fpage>1951</fpage>
          -
          <lpage>1964</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Campbell</surname>
          </string-name>
          , “
          <article-title>Speaker Recognition: A Tutorial,”</article-title>
          <source>Proceedings of IEEE</source>
          , vol.
          <volume>85</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>1437</fpage>
          -
          <lpage>1462</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [21] [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          , “
          <article-title>Speaker identification and verification using Gaussian mixture speaker models,” Speech Communication</article-title>
          , vol.
          <volume>17</volume>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>108</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Kenny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellet</surname>
          </string-name>
          , “
          <article-title>Front-end factor analysis for speaker verification</article-title>
          ,
          <source>” IEEE Transactions on Audio, Speech, and Language Processing</source>
          , vol.
          <volume>19</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>788</fpage>
          -
          <lpage>798</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Hautamäki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hautamäki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leino</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-M. Laukkanen</surname>
          </string-name>
          , “
          <article-title>I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry</article-title>
          .”
          <source>in Proc. Interspeech</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>930</fpage>
          -
          <lpage>934</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Hautamäki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hautamäki</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-M. Laukkanen</surname>
          </string-name>
          , “
          <article-title>Automatic versus human speaker verification: The case of voice mimicry,” Speech Communication</article-title>
          , vol.
          <volume>72</volume>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lindberg</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Blomberg</surname>
          </string-name>
          , “
          <article-title>Vulnerability in Speaker Verification-A Study of Technical Impostor Techniques,”</article-title>
          <source>Proceedings of European Conference on Speech Communication and Technology</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Villalba</surname>
          </string-name>
          and E. Lleida, “
          <article-title>Speaker verification performance degradation against spoofing and tampering attacks,”</article-title>
          <source>in FALA 10 workshop</source>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>134</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hosseinzadeh</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          , “
          <article-title>Combining vocal source and mfcc features for enhanced speaker recognition performance using gmms,”</article-title>
          <source>in Multimedia Signal Processing</source>
          ,
          <year>2007</year>
          .
          <article-title>MMSP 2007</article-title>
          .
          <article-title>IEEE 9th Workshop on</article-title>
          . IEEE,
          <year>2007</year>
          , pp.
          <fpage>365</fpage>
          -
          <lpage>368</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Furui</surname>
          </string-name>
          , “
          <article-title>Cepstral analysis technique for automatic speaker verification</article-title>
          ,
          <source>” IEEE Trans. Acoust</source>
          . Speech, and Signal Process., vol.
          <volume>29</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>254</fpage>
          -
          <lpage>272</lpage>
          , Apr.
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Krishna</surname>
            <given-names>Dutta</given-names>
          </string-name>
          ,”
          <article-title>hybrid fusion scheme for different speech Processing tasks” DOE, NIT Nagaland</article-title>
          , may
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Prasanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Yegnanarayana</surname>
          </string-name>
          , “
          <article-title>Extraction of speakerspecific excitation information from linear prediction residual of speech,” Speech Communication</article-title>
          , vol.
          <volume>48</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1243</fpage>
          -
          <lpage>1261</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M</given-names>
            <surname>Rahul</surname>
          </string-name>
          ,
          <string-name>
            <surname>N Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <surname>S Mishra</surname>
          </string-name>
          ,”
          <article-title>Facial expression recognition using geometric features and modified hidden Markov model”</article-title>
          ,
          <source>International Journal of Grid and Utility Computing10 (5)</source>
          ,
          <fpage>488</fpage>
          -
          <lpage>496</lpage>
          ,
          <year>2019</year>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Jyoti</surname>
          </string-name>
          , Amrita, Yadav, Vikash, Rahul, Mayur,” Blockchain Security Attacks, Difficulty, and Prevention”,
          <source>Recent Advances in Electrical &amp; Electronic Engineering (Formerly Recent Patents on Electrical &amp; Electronic Engineering)</source>
          ,
          <volume>16</volume>
          .
          <fpage>10</fpage>
          .2174/0123520965252489231002071659,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Rahul</surname>
          </string-name>
          , Mayur ,Tiwari, Namita , Prakash, Ayushi , Yadav, Vikash,”
          <article-title>Garment Defect Detection System Based on Histogram Using Deep Learning”</article-title>
          .
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-99-3716-5_
          <fpage>22</fpage>
          .,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Rahul</surname>
          </string-name>
          , Mayur , Shukla, Rati , Tyagi, Devvrat, Tiwari, Namita, Yadav, Vikash,”
          <article-title>A New Hybrid Approach for Efficient Emotion Recognition using Deep Learning</article-title>
          .”
          <source>International Journal of Electrical and Electronics Research</source>
          .
          <volume>10</volume>
          .
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          .
          <fpage>10</fpage>
          .37391/IJEER.100103,
          <year>2022</year>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Rahul</surname>
            , Mayur, Pal, Parashuram, Yadav, Vikash , Dellwar, Devendra, Singh,
            <given-names>Swarnima.</given-names>
          </string-name>
          ”
          <article-title>Impact of Similarity Measures in K-means Clustering Method used in Movie Recommender Systems”</article-title>
          . IOP Conference Series: Materials Science and Engineering.
          <volume>1022</volume>
          . 10.1088/
          <fpage>1757</fpage>
          -899X/1022/1/012101, 2021
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>