-

Speech Analytics Architecture for Banking Contact Centers

0 Plekhanov Russian University of Economics , 36 Stremyanny lane, Moscow, 115998 , Russia

0000 0002

According to the Central Bank of Russian Federation the number of organizations providing banking services in Russia is more than 400 [1]. One of the key features of mature and saturated market is competition between participants in either price or service quality [1]. Despite the spread of text communication instruments call center still remains one of the key channels for providing client services and promotion of products [2, 3]. The development of digital technologies based on machine learning algorithms such as speech recognition, sentiment analysis and semantic analysis opens new opportunities for financial organizations to achieve outstanding level of services offered by call centers by means of scrutinizing internal processes and inferring best practices of top performing employees. This paper addresses application of machine learning technologies in contact centers with focus on banking organizations. In the locus of the study are two business cases: • Correspondence of operator's speech to a call script • Increase of product sales. The article presents conceptual framework and the architecture for perspective software application based on convolutional neural networks (CNN) and recurrent neural networks (RNN) and designed for automation of banking call center operations analysis.

banking contact center machine learning speech recognition text mining

According to the Central Bank of Russian Federation the number of organizations providing banking services in Russia is more than 400 [ 4 ]. Despite a significant decrease in the number of banks in the country over the last 10 years the market is still highly competitive and can be compared to such countries as Austria, France and Italy [ 5 ]. One of the key features of mature and saturated market is competition between participants in either price or service quality [ 1 ]. Despite the spread of text communication instruments call center still remains one of the key channels for providing client services and promotion of products [ 2, 3 ]. The development of digital technologies based on machine learning algorithms such as speech recognition, sentiment analysis and semantic analysis reveals new opportunities for companies offering transformation of call centers. Besides the obvious application in order to substitute the manual job of supervisors, performing the quality control of operators’ performance, machine learning technologies can be employed as an effective and efficient instrument to improve the quality of the services offered by call centers by means of scrutinizing internal processes and inferring best practices of top performing employees.

This paper addresses application of machine learning in contact center operations with focus on banking organizations. In the locus of the study are two business cases: Business Case 1 (BC1): Correspondence of operator's speech to a call script.

Business Case 2 (BC2): Increase of product sales via contact center.

The structure of the rest of this paper is as follows. In section 2 the results of literature review and key theoretical foundations are presented. Section 3 explains the approach to the transformation of business problems into formal problems of machine learning domain. In Section 4 a theoretical framework for perspective software application is presented. The conceptual architecture of perspective application is designed in Section 5. Section 6 summarizes the findings and proposals and presents further directions of the research. 2

Machine learning foundations

The fundamentals of machine learning are based on the seminal paper by Rosenblatt [ 6 ], in which he proposed a device named perceptron, a model of a neuron that can be taught to recognize images. Rosenblatt’s idea was implemented in 1960 using IBM 704 computer. Further development of neural networks theory brought the concept of connectionism [ 7, 8 ], concept of distributed representations [ 9 ] and back propagation algorithm [ 10 ].

Modern methods of speech recognition are based mostly on one of the following algorithms:

 Hidden Markov Models (HMM) combined with Gaussian mixture models (GMM) [ 11 ];  Bayesian discrimination [ 12 ];  Dynamic Time Warping (DTW) [ 13 ] ;  Recurrent neural networks (RNN) [ 14 ];  Restricted Bolzmann Machines (RBM) [ 15 ].

HMM are based on the concept of Markov chain representing the interconnection of a set of variables or states and the probabilities of transition between them [ 16 ]. The model includes, as well, hidden states that are not observed directly. For instance, in the task of recognition, the parts of speech might be defined only from the context [ 17 ].

The key assumptions underlying the model are: 

Prediction of future state is independent of past observations;  The probability of the predicted observation depends only on the state that predetermined it.

Thus, only current state is analyzed, in order to predict the further transition.

Based on the HMM three formal problems were defined: likelihood, decoding and learning. Tackling these three issues allows to construct a predictive model for realtime speech recognition [ 18 ].

GMM refers to the class of probabilistic models assuming that within a dataset all the observations are produced following Gaussian distributions with unknown parameters. The formal problem is stated as definition of distribution parameters applying expectation-maximization algorithm given a set of observations [ 19 ].

Various combinations of these models were proposed by researchers and proved to be more effective than single model algorithm. For example, GMM might be implemented to derive the observation probabilities of certain states in HMM, or a decision tree might be designed, to maximize the likelihood value [ 20-23 ].

Unlike HMM, Bayesian approach presumes introduction of all model variables during model design and the posterior distribution of the variables is derived using Bayes rule [ 24 ]. Thus, the formal problem of the approach is distribution estimation.

Compared to HMM and GMM the Bayesian distribution approach provides the following advantages:  The predicted observation depends on a set of prior states;  Marginalization of model parameters yields improved classification;  The selection of model is performed using the probability maximization of model components posterior distribution [ 25 ].

The approach requires considerable computations restricting the areas of implementation.

DTW refers to the algorithms applied to time series. The prerequisite for its application is possibility to shrink or stretch one of the time series along the time axis. The «warping» procedure results in two time series with distributed proportionally along each other and serves as normalization phase. On the stage the distance between time series patterns is calculated providing a similarity measure [ 26 ]. Although the technique proved itself well as a method to authenticate a speaker by voice, its application in speech recognition domain is limited to voice authentication [ 27 ].

Neural networks are presented in the list of the algorithms by RNN and RBM. RNN is a neural network with input, hidden and output layers. Apart from learning during training procedure such design facilitates learning along network utilization [ 28 ].

RBM are constructed of nodes of two types, visible and hidden. Nodes between layers are interconnected by fully bipartite graph. The resulting model is stochastic and generative [ 29 ].

Over the last decade implementation of RBM for modeling input data yielded significant improvement in recognition rate and motivated academic researchers and industry experts to study application of deep learning to the speech analysis. RNN models extended the research field and outperformed the RBM networks reaching the recognition error rate of 17.7 percent [ 14 ].

The next academic field relevant for this research is Natural Language Processing (NLP). NLP is a cross disciplinary field involving linguistics, machine learning and psychology [ 30 ]. The first practical demonstration of NLP implementation refers to Georgetown-IBM experiment when a predecessor of contemporary Google Translate was demonstrated. Within the experiment 60 sentences where translated from Russian to English [ 31 ].

Inspired by recent implementations and successes the researchers intensively promoted new approaches to solve NLP tasks pushing forward the whole field. The stateof-the-art techniques applied for the problems of NLP include:

MEL is log-linear conditional probability model. The formal problem underlying the technique is selection of a model with maximum entropy from a set of models satisfying the training dataset. The process of selection is organized iteratively, therefore, requiring substantial computations [36].

MBL represents a straightforward machine learning algorithm based on a simple approximation approach. Every piece of data is stored in database and predictions are performed based on the similarity of input dataset to the stored ones characterized by a distance metric [37].

Another technique for classification problems of NLP is DT. It can be depicted as a hierarchical structure with every node representing a decision and every leaf representing a predetermined output class. DT is one of the most efficient machine learning algorithms for the NLP problem of part of speech definition [38].

CNN is considered to be a powerful deep learning algorithm designed as a multilayer perceptron with each neuron interconnected with every neuron of the next layer. This algorithm proved itself in the field of image recognition and text mining [39]. However, its effectiveness comes at price of significant computational costs [40].

Combination of the mentioned research areas provides theoretical basis for the development of the model appropriate for application to the defined business cases of the improvement of banking contact center efficiency. 3

Conceptual framework setup

In order to transform the baseline business cases into formal research problems I define the key operations in the business process flow. The first case refers to the problem of quality control and defined in this paper as correspondence of operator’s speech to a call script. The operator’s interaction with a client is recorded and stored on a dedicated server as a phonogram. Further the phonogram is transcribed using a speech-to-text algorithm. The next step includes analysis using text mining algorithms, in order to calculate the metrics of the correspondence of the communication flow to the predefined script. The above described case is depicted as the logical diagram in Figure 1.

Speech recognition relates to the classification problem of the machine learning domain. It can be defined as the task of designing an algorithm which allocates an arbitrary object to one of the predefined groups with a certain probability [41]. Assume I = (i1, i2 … i3) denotes a set of fragments received after recording an audio stream and O = (01, 02, …, 03) is a set of phonemes or words we expect to obtain. It is necessary to build a function f which computes the most probable sequence of phonemes or words O corresponding to the given set of audio fragments I: (1)

At the next step text data analysis is performed, to extract quality metrics of operator’s interaction. For instance, one of the text data clustering methods may be applied, to calculate the correspondence of recognized text to a call script. Additionally, classification of certain text fragments may be used, to discover the presence of welcome and farewell expressions in operator’s speech [42]. The number of similar metrics depends on the specific business model of quality evaluation.

The second business case extends the previous one. The accumulated text data and metrics describing the correspondence of interaction to a call script are enriched by sales data. On the next stage the resulting dataset is processed using regression analysis, to infer what characteristics of operator’s speech or parts of call scripts influence sales rates. Figure 3 describes the defined process.

With text metrics and additional sales data it is possible to provide classical regression analysis to explore the patterns influencing sales via contact center mostly. As a baseline model it is possible to design a simple linear regression model with successful sales as depend variable and a set of speech metrics as independent variables. Nonlinear regression may be constructed in case the resulting model does not fit well the dataset [43].

Proposed approaches Speech recognition

Following the conceptual framework constructed in the previous section the first operation to be performed is classification of audio fragments. State of the art approach yielding excellent results that may be employed for this task is RNN modeling. Graves et al. achieved the result of 17.7 per cent phoneme error rate on the TIMIT speech dataset [ 14 ]. The researchers designed a deep bidirectional RNN with Long Short-term Memory cells used for hidden layers.

The training process for the LSTM RNN includes several methods: connectionist temporal classification (CTC), RNN transducer, decoding and regularization. The approach provides simultaneous training of the network to classify the input acoustic information and to seek for the most appropriate following phonemes, thus constituting a joint acoustic and language model. 4.2

Text mining

Text mining process consists of standard operations over the initial dataset including:

Preprocessing includes basic procedures of tokenization, stemming and stop words removal. Initially a text stream, produced by the speech recognition algorithm is broken up into a set of separate elements, for instance, words, abbreviations, interjections. Further the tokenized set of textual data reduced to word stems, base or root forms from the inclined or derived forms. This process represents the stemming procedure. Lastly, using the list of stop words the dataset is cleared off them.

For the matter of calculation operator’s interaction quality two metrics are proposed: correspondence of the speech to a call script, presence of greetings, farewell and gratitude expression. Additionally, for the purpose of further analysis automatic topic of interaction is needed, therefore key words detection is also included into the text mining process.

In order to evaluate the correspondence of speech to a call script, a text matching model is proposed. Recent experiments by Pang et al. showed that it is possible to obtain with their model called MatchPyramid the level of accuracy comparing to texts up to 75.94 per cent on the MRSP dataset [44]. The research group applied image recognition algorithms designing a CNN with a matching matrix constituted of indicators representing similarity between words of compared texts. Optimal results for the model were achieved when processing fragments of 30-50 words. Considering that interaction of operator in banking contact center and client is dialog such fragments fit well the length of speech of every side.

The second metric can be estimated as well the MatchPyramid model. It is necessary to define the expressions dictionary as a baseline for comparison. 4.3

Data analysis

Final operation is building a regression model using enriched dataset consisting of information regarding operator, such as time of job shift, client’s gender, client’s age and others, defined as c1, c2, … ci, their metrics, denoted as m1 and m2 and sales results as sj. Putting sj as a response variable and ci with mj as explanatory variables yields the following relating model:

(2)

Assuming a simple linear relation between variables the equation can be transformed into the following model: (3)

The rule of thumb for a linear regression sample size suggests that the number of observations should be 20 times the number of independent variables. Therefore, it is needed 100 – 200 calls per day, to perform daily analysis or the analysis may be performed on a weekly or monthly basis.

In case the linear regression model provides poor results, it can be replaced with non-linear regression model. 5 5.1

Results Conceptual architecture design

It is shown in previous section that two of three key modules of the perspective speech analytics application can be designed employing RNN and CNN. The application can be built a 3-layer architecture: data access web-server ensuring security of data transfer, analysis server with deployed speech recognition and text mining models and database server storing recognized audio records, algorithms settings and application logs. All interactions with corporate information systems, such as corporate data warehouse and audio records storage is performed via data access web server. The overview of the landscape is presented in Figure 4.

On the application level the architecture of the solution is constructed of a set of components for speech recognition and text-mining operations, controllers for training and modelling, regression analysis component and representation layer. Although for the tasks of speech recognition and text analysis are used two different networks, the processes are consequent. Therefore, it is possible to utilize same hardware.

CRISP-DM standard suggests evaluation stage, to control modelling results [45]. To incorporate this practice application includes components for models update and retraining.

The architecture is displayed in Figure 5. 5.2

Discussion

More than 2/3 of requests for banking services are still applied via telephone. Moreover, the telemarketing is still one of the key financial products distribution channels. In both cases the quality of communication depends on the performance of operators in contact center.

While classical instruments for quality control, such as supervisory and selective checks allow to overview only limited number of interactions, automated control allows to cover all communications with clients bringing a holistic view over the call center operation. Moreover, the necessity of selective checks turns redundant and the number of supervisors and managers may be reduced providing a drop in staff cost. Apart from the presented outcomes the data derived from the phonograms integrated with information from other data sources, such as CRM systems, social networks, external and internal scoring systems within corporate data warehouse can produce additional value by expanding customer’s profile. Enriching customer data improves predictive power of recommendation models such as next best offer and provides improved conversion rates and sales figures.

Moreover, the designed architecture can easily be integrated with other communication channels, for instance, chat platforms. Chats deployed on the websites or implemented in mobile applications present ready-for-analysis data that can be processed using text mining algorithms. 6

Conclusions and further directions

This paper introduces a conceptual model and architecture for a software application based on CNN and RNN networks designed for automation of call center operations analysis. BC1 and BC2 defined in the introduction are of value for the majority of banking organizations.

The next step in the research includes developing a prototype of the application and performance of tests, to evaluate the recognition rate and quality of text analysis in one of the top 10 banks in the Russian Federation.

Further research directions may as well include definition of new business cases that can be solved using machine learning technologies or enhancement of the discussed cases applying new algorithms. 36. Neumann, S., N. Ahituv, and M. Zviran: A measure for determining the strategic relevance of IS to the organization. Information & management, 1992. 22(5): pp. 281-299 37. Lin, J.-H. and J.S. Vitter: A Theory for Memory-Based Learning. Machine Learning, 1994. 17(2): pp. 143-167 DOI: 10.1023/A:1022667616941. 38. Quinlan, J.R.: Induction of decision trees. Machine Learning, 1986. 1(1): pp. 81-106 DOI: 10.1007/BF00116251. 39. Gu, J., et al.: Recent advances in convolutional neural networks. Pattern Recognition, 2018. 77: pp. 354-377 DOI: 10.1016/j.patcog.2017.10.013. 40. He, K. and J. Sun: Convolutional neural networks at constrained time cost. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. DOI: 10.1109/CVPR.2015.7299173. 41. Goodfellow, I., Y. Bengio, and A. Courville: Deep Learning. 2016: The MIT Press. 42. Hotho, A.: A Brief Survey of Text Mining. GLDV-Journal for Computational Linguistics and Language Technology, 2005. 20(1): pp. 19-62 43. Draper, N.R.: Applied Regression Analysis. Wiley Series in Probability and Statistics, 1998. 29 DOI: 10.1002/9781118625590. 44. Pang, L., et al.: Text matching as image recognition, in Proceedings of the Thirtieth AAAI

Conference on Artificial Intelligence. 2016, AAAI Press: Phoenix, Arizona. p. 2793–2799 45. Chapman, P., et al.: CRISP-DM 1. 0: Step-by-Step Data Mining Guide. The CRISP-DM Consortium, 2000

1. Miller , D. , et al.: The Evolution of Strategic Simplicity: Exploring Two Models of Organizational Adaption . Journal of Management , 1996 . 22 ( 6 ): pp. 863 - 887 DOI: 10.1177/014920639602200604.

Ten

Years After Great Recession , Innovation Overcomes Reputation as Bank Switching Hits Record Low ,

J.D. Power

Finds . 2019 , J.D. Power

3. Deloitte: Global Contact Center Survey. 2019

4. Information on the Banking System of the Russian Federation . https://www.cbr.ru/eng/statistics/pdko/lic/, last accessed 2020 /02/01.

5. Facts and Figures Banking in Europe 2019. https://www.ebf.eu/facts-andfigures/statistical-annex/, last accessed 2020 /02/01.

6. Rosenblatt , F.F. : The perceptron: a probabilistic model for information storage and organization in the brain . Psychological review , 1958 . 65 6: pp. 386 - 408 DOI: 10.1037/h0042519.

7. Rumelhart , D.E. ,

J.L.

McClelland , and t.P.R. Group: Parallel Distributed Processing: Explorations in the Microstructure of Cognition . . 1986

8. McClelland , J. ,

Rumelhart , and G. Hinton: The appeal of parallel distributed processing . Computation & intelligence , 1995: pp. 305 - 341

9. 9. Hinton , G.E. , J. McClelland , and D. Rumelhart: Distributed representations . Parallel Distributed Processing: Explorations in the Microstructure of Cognition . Vol. 1 . 1986 . 77 - 109 .

10. Rumelhart , D.E. ,

G.E.

Hinton , and

R.J.

Williams : Learning representations by backpropagating errors . Nature , 1986 . 323 ( 6088 ): pp. 533 - 536 DOI: 10.1038/323533a0.

11. Bahl , L.R. , et al.: Speech recognition with continuous-parameter hidden Markov models . In ICASSP-88 ., International Conference on Acoustics, Speech, and

Signal

Processing . 1988 . DOI: 10 .1109/ICASSP. 1988 . 196504 .

12. Norris , D. and J.M. McQueen: Shortlist

: A Bayesian model of continuous speech recognition . Psychological Review ,, 2008 . 115 ( 2 ): pp. 357 - 395 DOI: 10.1037/ 0033 - 295X . 115 .2.357.

13. Muda , L. ,

Begam , and I.

Elamvazuthi: Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques. 2010

14. Graves , A. , A .-r. Mohamed, and G. Hinton: Speech Recognition with Deep Recurrent Neural Networks . In IEEE International Conference on Acoustics, Speech and Signal Processing . 2013 . DOI: 10 .1109/ICASSP. 2013 . 6638947 .

15. DahL , G.E.: Phone recognition with the mean-covariance restricted f \ boltzmann machine . Advances in neural informationprocessing systems , 2010 . 23: pp. 469 - 477

16. Aggoun , L. ,

J.B.

Moore , and

R.J.

Elliott : Hidden Markov models: estimation and control . Stochastic Modelling and Applied Probability . 1995 , Dordrecht: Springer.

17. Deng , L. , et al.: Large vocabulary word recognition using context-dependent allophonic hidden Markov models . Computer Speech & Language , 1990 . 4 ( 4 ): pp. 345 - 357 DOI: https://doi.org/10.1016/ 0885 - 2308 ( 90 ) 90015 - X .

18. Jurafsky , D. and

J.H.

Martin : Speech and language processing . Vol. 3 . 2014: Pearson London.

19. Reynolds , D. : Gaussian Mixture Models, in Encyclopedia of Biometrics,

S.Z.

Li and

A.K.

Jain , Editors. 2015 , Springer

: Boston, MA. pp. 827 - 832 .

20. Akamine , M. and J. Ajmera: Decision tree-based acoustic models for speech recognition . EURASIP Journal on Audio, Speech, and Music Processing , 2012 . 2012 (1): pp. 10 DOI: 10.1186/ 1687 -4722-2012-10.

21. Ju , Z. , et al.: Dynamic Grasp Recognition Using Time Clustering, Gaussian Mixture Models and Hidden Markov Models . Advanced Robotics , 2009 . 23 ( 10 ): pp. 1359 - 1371 DOI: 10.1163/156855309X462628.

22. Pujol , P. , et al.: Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system . IEEE Transactions on Speech and Audio Processing , 2005 . 13 ( 1 ): pp. 14 - 22 DOI: 10.1109/TSA. 2004 . 834466 .

23. Swietojanski , P. ,

Ghoshal , and S. Renals: Revisiting hybrid and GMM-HMM system combination techniques . In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing . 2013 . DOI: 10 .1109/ICASSP. 2013 . 6638967 .

24. Mr . Bayes and Mr.Price: An Essay towards Solving a Problem in the Doctrine of Chances. By the Late Rev . Mr. Bayes , F. R. S. Communicated by Mr . Price, in a Letter to John Canton, A. M. F. R. S. Philosophical Transactions ( 1683 - 1775 ), 1763 . 53: pp. 370 - 418 DOI: 10.1098/rstl.1763.0053.

25. Griffiths , T.L., C.

Kemp , and J.B.

Tenenbaum : Bayesian models of cognition , in The Cambridge handbook of computational psychology. 2008 , Cambridge University Press: New York, NY, US. pp. 59 - 100 .

26. Vasimalla , K. ,

Narasimham , and S. Naik: Efficient Dynamic Time Warping for Time Series Classification . Indian Journal of Science and Technology , 2016 . 9 DOI: 10.17485/ijst/2016/v9i21/93886.

27. Permanasari , Y. ,

Harahap , and E. Prayoga: Speech recognition using Dynamic Time Warping (DTW) . Journal of Physics: Conference Series , 2019 . 1366: pp. 012091 DOI: 10.1088/ 1742 -6596/1366/1/012091.

28. Rodriguez , P. , J. Wiles , and J.L.

Elman : A Recurrent

Neural

Network that Learns to Count . Connection Science , 1999 . 11 ( 1 ): pp. 5 - 40 DOI: 10.1080/095400999116340.

29. Fischer , A . and C. Igel: Training restricted Boltzmann machines: An introduction . Pattern Recognition , 2014 . 47 ( 1 ): pp. 25 - 39 DOI: 10.1016/j.patcog. 2013 . 05 .025.

30. Bird , S. , E. Klein, and E. Loper: Natural Language Processing with Python . 2009 :

'Reilly Media , Inc.

31. Hutchins , W.: The Georgetown-IBM Experiment Demonstrated in January 1954 . Vol. 3265 . 2004 . 102 - 114 .

32. Mikheev , A. : Feature lattices for maximum entropy modelling , in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2 . 1998 , Association for Computational Linguistics: Montreal, Quebec, Canada. p. 848 - 854 DOI: 10.3115/980691.980709.

33. Daelemans , W. and A.v.d. Bosch: Memory-Based Language Processing . 2009 : Cambridge University Press.

34. Cardie , C. : Using decision trees to improve case-based learning , in Proceedings of the Tenth International Conference on International Conference on Machine Learning . 1993 , Morgan Kaufmann Publishers Inc.: Amherst, MA, USA. p. 25 - 32 DOI: 10.1016/b978-1- 55860-307-3 . 50010 - 1 .

35. Kim , Y. Convolutional Neural Networks for Sentence Classification . arXiv e-prints, 2014 . DOI: 10 .3115/v1/ D14 -1181.