<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Speech Analytics Architecture for Banking Contact Centers</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Plekhanov Russian University of Economics</institution>
          ,
          <addr-line>36 Stremyanny lane, Moscow, 115998</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>According to the Central Bank of Russian Federation the number of organizations providing banking services in Russia is more than 400 [1]. One of the key features of mature and saturated market is competition between participants in either price or service quality [1]. Despite the spread of text communication instruments call center still remains one of the key channels for providing client services and promotion of products [2, 3]. The development of digital technologies based on machine learning algorithms such as speech recognition, sentiment analysis and semantic analysis opens new opportunities for financial organizations to achieve outstanding level of services offered by call centers by means of scrutinizing internal processes and inferring best practices of top performing employees. This paper addresses application of machine learning technologies in contact centers with focus on banking organizations. In the locus of the study are two business cases: • Correspondence of operator's speech to a call script • Increase of product sales. The article presents conceptual framework and the architecture for perspective software application based on convolutional neural networks (CNN) and recurrent neural networks (RNN) and designed for automation of banking call center operations analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>banking</kwd>
        <kwd>contact center</kwd>
        <kwd>machine learning</kwd>
        <kwd>speech recognition</kwd>
        <kwd>text mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        According to the Central Bank of Russian Federation the number of organizations
providing banking services in Russia is more than 400 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Despite a significant
decrease in the number of banks in the country over the last 10 years the market is still
highly competitive and can be compared to such countries as Austria, France and Italy
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One of the key features of mature and saturated market is competition between
participants in either price or service quality [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite the spread of text
communication instruments call center still remains one of the key channels for providing
client services and promotion of products [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. The development of digital
technologies based on machine learning algorithms such as speech recognition, sentiment
analysis and semantic analysis reveals new opportunities for companies offering
transformation of call centers. Besides the obvious application in order to substitute
the manual job of supervisors, performing the quality control of operators’
performance, machine learning technologies can be employed as an effective and efficient
instrument to improve the quality of the services offered by call centers by means of
scrutinizing internal processes and inferring best practices of top performing
employees.
      </p>
      <p>This paper addresses application of machine learning in contact center operations
with focus on banking organizations. In the locus of the study are two business cases:
Business Case 1 (BC1): Correspondence of operator's speech to a call script.</p>
      <p>Business Case 2 (BC2): Increase of product sales via contact center.</p>
      <p>The structure of the rest of this paper is as follows. In section 2 the results of
literature review and key theoretical foundations are presented. Section 3 explains the
approach to the transformation of business problems into formal problems of machine
learning domain. In Section 4 a theoretical framework for perspective software
application is presented. The conceptual architecture of perspective application is designed
in Section 5. Section 6 summarizes the findings and proposals and presents further
directions of the research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Machine learning foundations</title>
      <p>
        The fundamentals of machine learning are based on the seminal paper by Rosenblatt
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in which he proposed a device named perceptron, a model of a neuron that can be
taught to recognize images. Rosenblatt’s idea was implemented in 1960 using IBM
704 computer. Further development of neural networks theory brought the concept of
connectionism [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], concept of distributed representations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and back propagation
algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Modern methods of speech recognition are based mostly on one of the following
algorithms:</p>
      <p>
         Hidden Markov Models (HMM) combined with Gaussian mixture models
(GMM) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ];
 Bayesian discrimination [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ];
 Dynamic Time Warping (DTW) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] ;
 Recurrent neural networks (RNN) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ];
 Restricted Bolzmann Machines (RBM) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        HMM are based on the concept of Markov chain representing the interconnection of
a set of variables or states and the probabilities of transition between them [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The
model includes, as well, hidden states that are not observed directly. For instance, in
the task of recognition, the parts of speech might be defined only from the context
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>The key assumptions underlying the model are:
</p>
      <p>Prediction of future state is independent of past observations;
 The probability of the predicted observation depends only on the state that
predetermined it.</p>
      <p>Thus, only current state is analyzed, in order to predict the further transition.</p>
      <p>
        Based on the HMM three formal problems were defined: likelihood, decoding and
learning. Tackling these three issues allows to construct a predictive model for
realtime speech recognition [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        GMM refers to the class of probabilistic models assuming that within a dataset all
the observations are produced following Gaussian distributions with unknown
parameters. The formal problem is stated as definition of distribution parameters applying
expectation-maximization algorithm given a set of observations [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Various combinations of these models were proposed by researchers and proved to
be more effective than single model algorithm. For example, GMM might be
implemented to derive the observation probabilities of certain states in HMM, or a decision
tree might be designed, to maximize the likelihood value [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23">20-23</xref>
        ].
      </p>
      <p>
        Unlike HMM, Bayesian approach presumes introduction of all model variables
during model design and the posterior distribution of the variables is derived using Bayes
rule [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Thus, the formal problem of the approach is distribution estimation.
      </p>
      <p>
        Compared to HMM and GMM the Bayesian distribution approach provides the
following advantages:
 The predicted observation depends on a set of prior states;
 Marginalization of model parameters yields improved classification;
 The selection of model is performed using the probability maximization of
model components posterior distribution [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>The approach requires considerable computations restricting the areas of
implementation.</p>
      <p>
        DTW refers to the algorithms applied to time series. The prerequisite for its
application is possibility to shrink or stretch one of the time series along the time axis. The
«warping» procedure results in two time series with distributed proportionally along
each other and serves as normalization phase. On the stage the distance between time
series patterns is calculated providing a similarity measure [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Although the
technique proved itself well as a method to authenticate a speaker by voice, its application
in speech recognition domain is limited to voice authentication [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        Neural networks are presented in the list of the algorithms by RNN and RBM. RNN
is a neural network with input, hidden and output layers. Apart from learning during
training procedure such design facilitates learning along network utilization [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        RBM are constructed of nodes of two types, visible and hidden. Nodes between
layers are interconnected by fully bipartite graph. The resulting model is stochastic and
generative [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>
        Over the last decade implementation of RBM for modeling input data yielded
significant improvement in recognition rate and motivated academic researchers and
industry experts to study application of deep learning to the speech analysis. RNN
models extended the research field and outperformed the RBM networks reaching the
recognition error rate of 17.7 percent [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        The next academic field relevant for this research is Natural Language Processing
(NLP). NLP is a cross disciplinary field involving linguistics, machine learning and
psychology [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. The first practical demonstration of NLP implementation refers to
Georgetown-IBM experiment when a predecessor of contemporary Google Translate
was demonstrated. Within the experiment 60 sentences where translated from Russian
to English [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
      <p>Inspired by recent implementations and successes the researchers intensively
promoted new approaches to solve NLP tasks pushing forward the whole field. The
stateof-the-art techniques applied for the problems of NLP include:</p>
      <p>MEL is log-linear conditional probability model. The formal problem underlying
the technique is selection of a model with maximum entropy from a set of models
satisfying the training dataset. The process of selection is organized iteratively,
therefore, requiring substantial computations [36].</p>
      <p>MBL represents a straightforward machine learning algorithm based on a simple
approximation approach. Every piece of data is stored in database and predictions are
performed based on the similarity of input dataset to the stored ones characterized by
a distance metric [37].</p>
      <p>Another technique for classification problems of NLP is DT. It can be depicted as a
hierarchical structure with every node representing a decision and every leaf
representing a predetermined output class. DT is one of the most efficient machine learning
algorithms for the NLP problem of part of speech definition [38].</p>
      <p>CNN is considered to be a powerful deep learning algorithm designed as a
multilayer perceptron with each neuron interconnected with every neuron of the next layer.
This algorithm proved itself in the field of image recognition and text mining [39].
However, its effectiveness comes at price of significant computational costs [40].</p>
      <p>Combination of the mentioned research areas provides theoretical basis for the
development of the model appropriate for application to the defined business cases of
the improvement of banking contact center efficiency.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conceptual framework setup</title>
      <p>In order to transform the baseline business cases into formal research problems I
define the key operations in the business process flow. The first case refers to the
problem of quality control and defined in this paper as correspondence of operator’s
speech to a call script. The operator’s interaction with a client is recorded and stored
on a dedicated server as a phonogram. Further the phonogram is transcribed using a
speech-to-text algorithm. The next step includes analysis using text mining
algorithms, in order to calculate the metrics of the correspondence of the communication
flow to the predefined script. The above described case is depicted as the logical
diagram in Figure 1.</p>
      <p>Speech recognition relates to the classification problem of the machine learning
domain. It can be defined as the task of designing an algorithm which allocates an
arbitrary object to one of the predefined groups with a certain probability [41].
Assume I = (i1, i2 … i3) denotes a set of fragments received after recording an audio
stream and O = (01, 02, …, 03) is a set of phonemes or words we expect to obtain. It is
necessary to build a function f which computes the most probable sequence of
phonemes or words O corresponding to the given set of audio fragments I:
(1)</p>
      <p>At the next step text data analysis is performed, to extract quality metrics of
operator’s interaction. For instance, one of the text data clustering methods may be applied,
to calculate the correspondence of recognized text to a call script. Additionally,
classification of certain text fragments may be used, to discover the presence of welcome
and farewell expressions in operator’s speech [42]. The number of similar metrics
depends on the specific business model of quality evaluation.</p>
      <p>The second business case extends the previous one. The accumulated text data and
metrics describing the correspondence of interaction to a call script are enriched by
sales data. On the next stage the resulting dataset is processed using regression
analysis, to infer what characteristics of operator’s speech or parts of call scripts influence
sales rates. Figure 3 describes the defined process.</p>
      <p>With text metrics and additional sales data it is possible to provide classical
regression analysis to explore the patterns influencing sales via contact center mostly. As a
baseline model it is possible to design a simple linear regression model with
successful sales as depend variable and a set of speech metrics as independent variables.
Nonlinear regression may be constructed in case the resulting model does not fit well
the dataset [43].</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed approaches</title>
      <sec id="sec-4-1">
        <title>Speech recognition</title>
        <p>
          Following the conceptual framework constructed in the previous section the first
operation to be performed is classification of audio fragments. State of the art
approach yielding excellent results that may be employed for this task is RNN
modeling. Graves et al. achieved the result of 17.7 per cent phoneme error rate on the
TIMIT speech dataset [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The researchers designed a deep bidirectional RNN with
Long Short-term Memory cells used for hidden layers.
        </p>
        <p>The training process for the LSTM RNN includes several methods: connectionist
temporal classification (CTC), RNN transducer, decoding and regularization. The
approach provides simultaneous training of the network to classify the input acoustic
information and to seek for the most appropriate following phonemes, thus
constituting a joint acoustic and language model.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Text mining</title>
        <p>Text mining process consists of standard operations over the initial dataset
including:</p>
        <p>Preprocessing includes basic procedures of tokenization, stemming and stop words
removal. Initially a text stream, produced by the speech recognition algorithm is
broken up into a set of separate elements, for instance, words, abbreviations,
interjections. Further the tokenized set of textual data reduced to word stems, base or root
forms from the inclined or derived forms. This process represents the stemming
procedure. Lastly, using the list of stop words the dataset is cleared off them.</p>
        <p>For the matter of calculation operator’s interaction quality two metrics are proposed:
correspondence of the speech to a call script, presence of greetings, farewell and
gratitude expression. Additionally, for the purpose of further analysis automatic topic of
interaction is needed, therefore key words detection is also included into the text
mining process.</p>
        <p>In order to evaluate the correspondence of speech to a call script, a text matching
model is proposed. Recent experiments by Pang et al. showed that it is possible to
obtain with their model called MatchPyramid the level of accuracy comparing to texts
up to 75.94 per cent on the MRSP dataset [44]. The research group applied image
recognition algorithms designing a CNN with a matching matrix constituted of
indicators representing similarity between words of compared texts. Optimal results for the
model were achieved when processing fragments of 30-50 words. Considering that
interaction of operator in banking contact center and client is dialog such fragments fit
well the length of speech of every side.</p>
        <p>The second metric can be estimated as well the MatchPyramid model. It is
necessary to define the expressions dictionary as a baseline for comparison.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Data analysis</title>
        <p>Final operation is building a regression model using enriched dataset consisting of
information regarding operator, such as time of job shift, client’s gender, client’s age
and others, defined as c1, c2, … ci, their metrics, denoted as m1 and m2 and sales
results as sj. Putting sj as a response variable and ci with mj as explanatory variables
yields the following relating model:</p>
        <p>(2)</p>
        <p>Assuming a simple linear relation between variables the equation can be
transformed into the following model:
(3)</p>
        <p>The rule of thumb for a linear regression sample size suggests that the number of
observations should be 20 times the number of independent variables. Therefore, it is
needed 100 – 200 calls per day, to perform daily analysis or the analysis may be
performed on a weekly or monthly basis.</p>
        <p>In case the linear regression model provides poor results, it can be replaced with
non-linear regression model.
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <sec id="sec-5-1">
        <title>Conceptual architecture design</title>
        <p>It is shown in previous section that two of three key modules of the perspective
speech analytics application can be designed employing RNN and CNN. The
application can be built a 3-layer architecture: data access web-server ensuring security of
data transfer, analysis server with deployed speech recognition and text mining
models and database server storing recognized audio records, algorithms settings and
application logs. All interactions with corporate information systems, such as corporate
data warehouse and audio records storage is performed via data access web server.
The overview of the landscape is presented in Figure 4.</p>
        <p>On the application level the architecture of the solution is constructed of a set of
components for speech recognition and text-mining operations, controllers for
training and modelling, regression analysis component and representation layer. Although
for the tasks of speech recognition and text analysis are used two different networks,
the processes are consequent. Therefore, it is possible to utilize same hardware.</p>
        <p>CRISP-DM standard suggests evaluation stage, to control modelling results [45]. To
incorporate this practice application includes components for models update and
retraining.</p>
        <p>The architecture is displayed in Figure 5.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Discussion</title>
        <p>More than 2/3 of requests for banking services are still applied via telephone.
Moreover, the telemarketing is still one of the key financial products distribution channels.
In both cases the quality of communication depends on the performance of operators
in contact center.</p>
        <p>While classical instruments for quality control, such as supervisory and selective
checks allow to overview only limited number of interactions, automated control
allows to cover all communications with clients bringing a holistic view over the call
center operation. Moreover, the necessity of selective checks turns redundant and the
number of supervisors and managers may be reduced providing a drop in staff cost.
Apart from the presented outcomes the data derived from the phonograms integrated
with information from other data sources, such as CRM systems, social networks,
external and internal scoring systems within corporate data warehouse can produce
additional value by expanding customer’s profile. Enriching customer data improves
predictive power of recommendation models such as next best offer and provides
improved conversion rates and sales figures.</p>
        <p>Moreover, the designed architecture can easily be integrated with other
communication channels, for instance, chat platforms. Chats deployed on the websites or
implemented in mobile applications present ready-for-analysis data that can be processed
using text mining algorithms.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and further directions</title>
      <p>This paper introduces a conceptual model and architecture for a software application
based on CNN and RNN networks designed for automation of call center operations
analysis. BC1 and BC2 defined in the introduction are of value for the majority of
banking organizations.</p>
      <p>The next step in the research includes developing a prototype of the application and
performance of tests, to evaluate the recognition rate and quality of text analysis in
one of the top 10 banks in the Russian Federation.</p>
      <p>Further research directions may as well include definition of new business cases that
can be solved using machine learning technologies or enhancement of the discussed
cases applying new algorithms.
36. Neumann, S., N. Ahituv, and M. Zviran: A measure for determining the strategic relevance
of IS to the organization. Information &amp; management, 1992. 22(5): pp. 281-299
37. Lin, J.-H. and J.S. Vitter: A Theory for Memory-Based Learning. Machine Learning,
1994. 17(2): pp. 143-167 DOI: 10.1023/A:1022667616941.
38. Quinlan, J.R.: Induction of decision trees. Machine Learning, 1986. 1(1): pp. 81-106 DOI:
10.1007/BF00116251.
39. Gu, J., et al.: Recent advances in convolutional neural networks. Pattern Recognition,
2018. 77: pp. 354-377 DOI: 10.1016/j.patcog.2017.10.013.
40. He, K. and J. Sun: Convolutional neural networks at constrained time cost. In 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). 2015. DOI:
10.1109/CVPR.2015.7299173.
41. Goodfellow, I., Y. Bengio, and A. Courville: Deep Learning. 2016: The MIT Press.
42. Hotho, A.: A Brief Survey of Text Mining. GLDV-Journal for Computational Linguistics
and Language Technology, 2005. 20(1): pp. 19-62
43. Draper, N.R.: Applied Regression Analysis. Wiley Series in Probability and Statistics,
1998. 29 DOI: 10.1002/9781118625590.
44. Pang, L., et al.: Text matching as image recognition, in Proceedings of the Thirtieth AAAI</p>
      <p>Conference on Artificial Intelligence. 2016, AAAI Press: Phoenix, Arizona. p. 2793–2799
45. Chapman, P., et al.: CRISP-DM 1. 0: Step-by-Step Data Mining Guide. The CRISP-DM
Consortium, 2000</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>The Evolution of Strategic Simplicity: Exploring Two Models of Organizational Adaption</article-title>
          .
          <source>Journal of Management</source>
          ,
          <year>1996</year>
          .
          <volume>22</volume>
          (
          <issue>6</issue>
          ): pp.
          <fpage>863</fpage>
          -
          <lpage>887</lpage>
          DOI: 10.1177/014920639602200604.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Ten</given-names>
            <surname>Years After Great Recession</surname>
          </string-name>
          ,
          <article-title>Innovation Overcomes Reputation as Bank Switching Hits Record Low</article-title>
          ,
          <string-name>
            <given-names>J.D. Power</given-names>
            <surname>Finds</surname>
          </string-name>
          .
          <year>2019</year>
          , J.D. Power
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Deloitte: Global Contact Center Survey.
          <year>2019</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>4. Information on the Banking System of the Russian Federation</article-title>
          . https://www.cbr.ru/eng/statistics/pdko/lic/,
          <source>last accessed</source>
          <year>2020</year>
          /02/01.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Facts and Figures Banking in Europe 2019. https://www.ebf.eu/facts-andfigures/statistical-annex/,
          <source>last accessed</source>
          <year>2020</year>
          /02/01.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rosenblatt</surname>
            ,
            <given-names>F.F.</given-names>
          </string-name>
          :
          <article-title>The perceptron: a probabilistic model for information storage and organization in the brain</article-title>
          .
          <source>Psychological review</source>
          ,
          <year>1958</year>
          . 65 6: pp.
          <fpage>386</fpage>
          -
          <lpage>408</lpage>
          DOI: 10.1037/h0042519.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rumelhart</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.L.</given-names>
            <surname>McClelland</surname>
          </string-name>
          , and t.P.R. Group: Parallel Distributed Processing:
          <article-title>Explorations in the Microstructure of Cognition</article-title>
          . .
          <source>1986</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>McClelland</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rumelhart</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Hinton: The appeal of parallel distributed processing</article-title>
          .
          <source>Computation &amp; intelligence</source>
          , 1995: pp.
          <fpage>305</fpage>
          -
          <lpage>341</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. 9.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>J. McClelland</surname>
            ,
            <given-names>and D.</given-names>
          </string-name>
          <article-title>Rumelhart: Distributed representations</article-title>
          .
          <source>Parallel Distributed Processing: Explorations in the Microstructure of Cognition</source>
          . Vol.
          <volume>1</volume>
          .
          <year>1986</year>
          .
          <volume>77</volume>
          -
          <fpage>109</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Rumelhart</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>G.E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.J.</given-names>
            <surname>Williams</surname>
          </string-name>
          :
          <article-title>Learning representations by backpropagating errors</article-title>
          .
          <source>Nature</source>
          ,
          <year>1986</year>
          .
          <volume>323</volume>
          (
          <issue>6088</issue>
          ): pp.
          <fpage>533</fpage>
          -
          <lpage>536</lpage>
          DOI: 10.1038/323533a0.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Bahl</surname>
            ,
            <given-names>L.R.</given-names>
          </string-name>
          , et al.:
          <article-title>Speech recognition with continuous-parameter hidden Markov models</article-title>
          .
          <source>In ICASSP-88</source>
          ., International Conference on Acoustics, Speech, and
          <string-name>
            <given-names>Signal</given-names>
            <surname>Processing</surname>
          </string-name>
          .
          <year>1988</year>
          . DOI:
          <volume>10</volume>
          .1109/ICASSP.
          <year>1988</year>
          .
          <volume>196504</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Norris</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>J.M. McQueen: Shortlist</surname>
            <given-names>B</given-names>
          </string-name>
          :
          <article-title>A Bayesian model of continuous speech recognition</article-title>
          .
          <source>Psychological Review</source>
          ,,
          <year>2008</year>
          .
          <volume>115</volume>
          (
          <issue>2</issue>
          ): pp.
          <fpage>357</fpage>
          -
          <lpage>395</lpage>
          DOI: 10.1037/
          <fpage>0033</fpage>
          -
          <lpage>295X</lpage>
          .
          <year>115</year>
          .2.357.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Muda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Begam</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I.</surname>
          </string-name>
          <article-title>Elamvazuthi: Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques. 2010</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
            .-r. Mohamed, and
            <given-names>G.</given-names>
          </string-name>
          <article-title>Hinton: Speech Recognition with Deep Recurrent Neural Networks</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          .
          <year>2013</year>
          . DOI:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2013</year>
          .
          <volume>6638947</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>DahL</surname>
          </string-name>
          , G.E.:
          <article-title>Phone recognition with the mean-covariance restricted f \ boltzmann machine</article-title>
          .
          <source>Advances in neural informationprocessing systems</source>
          ,
          <year>2010</year>
          . 23: pp.
          <fpage>469</fpage>
          -
          <lpage>477</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Aggoun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.B.</given-names>
            <surname>Moore</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.J.</given-names>
            <surname>Elliott</surname>
          </string-name>
          <article-title>: Hidden Markov models: estimation and control</article-title>
          .
          <source>Stochastic Modelling and Applied Probability</source>
          .
          <year>1995</year>
          , Dordrecht: Springer.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>Large vocabulary word recognition using context-dependent allophonic hidden Markov models</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          ,
          <year>1990</year>
          .
          <volume>4</volume>
          (
          <issue>4</issue>
          ): pp.
          <fpage>345</fpage>
          -
          <lpage>357</lpage>
          DOI: https://doi.org/10.1016/
          <fpage>0885</fpage>
          -
          <lpage>2308</lpage>
          (
          <issue>90</issue>
          )
          <fpage>90015</fpage>
          -
          <lpage>X</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Martin</surname>
          </string-name>
          <article-title>: Speech and language processing</article-title>
          . Vol.
          <volume>3</volume>
          . 2014: Pearson London.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Gaussian Mixture Models, in Encyclopedia of Biometrics,
          <string-name>
            <given-names>S.Z.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.K.</given-names>
            <surname>Jain</surname>
          </string-name>
          , Editors.
          <year>2015</year>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          : Boston, MA. pp.
          <fpage>827</fpage>
          -
          <lpage>832</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Akamine</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Ajmera: Decision tree-based acoustic models for speech recognition</article-title>
          .
          <source>EURASIP Journal on Audio, Speech, and Music Processing</source>
          ,
          <year>2012</year>
          .
          <year>2012</year>
          (1): pp.
          <volume>10</volume>
          DOI: 10.1186/
          <fpage>1687</fpage>
          -4722-2012-10.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ju</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , et al.:
          <article-title>Dynamic Grasp Recognition Using Time Clustering, Gaussian Mixture Models and Hidden Markov Models</article-title>
          .
          <source>Advanced Robotics</source>
          ,
          <year>2009</year>
          .
          <volume>23</volume>
          (
          <issue>10</issue>
          ): pp.
          <fpage>1359</fpage>
          -
          <lpage>1371</lpage>
          DOI: 10.1163/156855309X462628.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Pujol</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system</article-title>
          .
          <source>IEEE Transactions on Speech and Audio Processing</source>
          ,
          <year>2005</year>
          .
          <volume>13</volume>
          (
          <issue>1</issue>
          ): pp.
          <fpage>14</fpage>
          -
          <lpage>22</lpage>
          DOI: 10.1109/TSA.
          <year>2004</year>
          .
          <volume>834466</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Swietojanski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghoshal</surname>
          </string-name>
          , and
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Renals: Revisiting hybrid and GMM-HMM system combination techniques</article-title>
          .
          <source>In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          .
          <year>2013</year>
          . DOI:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2013</year>
          .
          <volume>6638967</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Mr</surname>
          </string-name>
          .
          <article-title>Bayes and Mr.Price: An Essay towards Solving a Problem in the Doctrine of Chances. By the Late Rev</article-title>
          . Mr.
          <string-name>
            <surname>Bayes</surname>
            ,
            <given-names>F. R. S.</given-names>
          </string-name>
          <article-title>Communicated by Mr</article-title>
          . Price, in a Letter to John Canton,
          <string-name>
            <surname>A. M. F. R. S. Philosophical Transactions</surname>
          </string-name>
          (
          <volume>1683</volume>
          -
          <fpage>1775</fpage>
          ),
          <volume>1763</volume>
          . 53: pp.
          <fpage>370</fpage>
          -
          <lpage>418</lpage>
          DOI: 10.1098/rstl.1763.0053.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Griffiths</surname>
            , T.L.,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Kemp</surname>
            , and
            <given-names>J.B.</given-names>
          </string-name>
          <string-name>
            <surname>Tenenbaum</surname>
          </string-name>
          :
          <article-title>Bayesian models of cognition</article-title>
          , in The Cambridge handbook of computational psychology.
          <source>2008</source>
          , Cambridge University Press: New York, NY, US. pp.
          <fpage>59</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Vasimalla</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Narasimham</surname>
          </string-name>
          , and S. Naik:
          <article-title>Efficient Dynamic Time Warping for Time Series Classification</article-title>
          .
          <source>Indian Journal of Science and Technology</source>
          ,
          <year>2016</year>
          . 9 DOI: 10.17485/ijst/2016/v9i21/93886.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Permanasari</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Harahap</surname>
          </string-name>
          , and E. Prayoga:
          <article-title>Speech recognition using Dynamic Time Warping (DTW)</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          ,
          <year>2019</year>
          . 1366: pp.
          <volume>012091</volume>
          DOI: 10.1088/
          <fpage>1742</fpage>
          -6596/1366/1/012091.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>J. Wiles</surname>
            , and
            <given-names>J.L.</given-names>
          </string-name>
          <string-name>
            <surname>Elman</surname>
            :
            <given-names>A Recurrent</given-names>
          </string-name>
          <string-name>
            <surname>Neural</surname>
          </string-name>
          <article-title>Network that Learns to Count</article-title>
          .
          <source>Connection Science</source>
          ,
          <year>1999</year>
          .
          <volume>11</volume>
          (
          <issue>1</issue>
          ): pp.
          <fpage>5</fpage>
          -
          <lpage>40</lpage>
          DOI: 10.1080/095400999116340.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . and
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Igel: Training restricted Boltzmann machines: An introduction</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <year>2014</year>
          .
          <volume>47</volume>
          (
          <issue>1</issue>
          ): pp.
          <fpage>25</fpage>
          -
          <lpage>39</lpage>
          DOI: 10.1016/j.patcog.
          <year>2013</year>
          .
          <volume>05</volume>
          .025.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , E. Klein, and E.
          <source>Loper: Natural Language Processing with Python</source>
          .
          <year>2009</year>
          :
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly Media</surname>
          </string-name>
          , Inc.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Hutchins</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>The Georgetown-IBM Experiment Demonstrated in January 1954</article-title>
          . Vol.
          <volume>3265</volume>
          .
          <year>2004</year>
          .
          <volume>102</volume>
          -
          <fpage>114</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Mikheev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Feature lattices for maximum entropy modelling</article-title>
          ,
          <source>in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume</source>
          <volume>2</volume>
          .
          <year>1998</year>
          , Association for Computational Linguistics: Montreal, Quebec, Canada. p.
          <fpage>848</fpage>
          -
          <lpage>854</lpage>
          DOI: 10.3115/980691.980709.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and A.v.d.
          <source>Bosch: Memory-Based Language Processing</source>
          .
          <year>2009</year>
          : Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Cardie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Using decision trees to improve case-based learning</article-title>
          ,
          <source>in Proceedings of the Tenth International Conference on International Conference on Machine Learning</source>
          .
          <year>1993</year>
          , Morgan Kaufmann Publishers Inc.: Amherst, MA, USA. p.
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          <source>DOI: 10.1016/b978-1- 55860-307-3</source>
          .
          <fpage>50010</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Convolutional Neural Networks for Sentence Classification</article-title>
          . arXiv e-prints,
          <year>2014</year>
          . DOI:
          <volume>10</volume>
          .3115/v1/
          <fpage>D14</fpage>
          -1181.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>