CCS CONCEPTS

A Heterogeneous Conversational Recommender System for Financial Products

Mao Kang

kangmao028@pingan.com.cn 1

Ye Bi

biye645@pingan.com.cn 1

Zhenyu Wu

wuzhenyu447@pingan.com.cn 1

Jianming Wang

wangjianming888@pingan.com.cn 2

Jing Xiao

xiaojing661@pingan.com.cn 2

Conversational Recommender System, Financial Products Recom-

ACM Reference Format:

0 0 Mao Kang, Ye Bi, Zhenyu Wu, Jianming Wang, and Jing Xiao. 2020. A Heterogeneous Conversational Recommender System for Financial Products. 1 Ping An Technology (Shenzhen) Co., Ltd , Shanghai , China 2 Ping An Technology (Shenzhen) Co., Ltd , Shenzhen , China 3 mendation , Heterogeneous Modelling , Deep Neural Networks

2019

Financial products recommendation distinguishes itself from ecommerce and web recommendation. Financial products have fewer available items, are more expensive, less frequently purchased and subject to user specific constraints. The study in financial products recommendation is quite limited and current industry application is still focusing on exploiting machine learning techniques. Behavioral Finance theory states financial decisions are afected by psychological behavior biases, which are generally identified via conversation with professional advisors. Besides, in a conversation customer actively express subjective requirements and interests, which cannot be known from their static structured data. Inspired by that, we propose an innovative heterogeneous conversational recommender system (HConvoNet) which will consider not only customer's static profile but also the implicit behavior biases and interests, thus is adaptive to customer. The proposed framework consists of two modules: profile module and conversation module. The profile module aims to capture customer's important static needs, while the conversation module aims to extract behavior biases and dynamic interests. By integrating profile module and conversation module, HConvoNet can recommend financial products in an adaptive way. The experiments are conducted on three internal datasets from Ping An Insurance and try to predict customer's purchase intention. We compare our model with several baselines and see that our proposed model has a significant improvement.

CCS CONCEPTS

• Information systems → Recommender systems; • Computing methodologies → Information extraction; 1

INTRODUCTION

Recommender Systems are extensively used in various areas. Most of the researches focus on collaborative-filtering and content-based ifltering. Collaborative-filtering assumes that users agreed in the past will like similar items. Content-based filtering tries to recommend similar items the user liked in the past. E-commerce companies like Amazon, ebay and Alibaba use well-developped collaborative-filtering algorithms to recommend products. Video and music websites like Youtube and Spotify use content-based ifltering to recommend playlists.

In recent years, the research has extended to recommend financial products and insurances (we will call them together as “financial products”). Financial products recommendation is quite diferent from the above mentioned recommendation. E-commerce companies usually have large amount of data and frequent user actions. While, financial products have fewer available items and are not frequently purchased. Besides, they are usually more expensive and subject to user specific constraints. Knowledge-based Recommender System is a specific type of Recommender System which uses knowledge base and user profile to make personalized recommendation. It is typically applied in the domains where collaborative-filtering and content-based filtering cannot be applied, such as financial products recommendation. Most of the current studies on this topic are still in the scope of constraint-based or case-based reasoning. In practice, building knowledge base is complicated and costly, thus the practical application still relies on exploiting customer profiles using machine learning techniques for the simplicity, robustness and good explanation, such as Random Forest and Generalized Linear Models.

Recommending financial products requires thorough understanding about financial decision-making process. Behavioral Finance[ 15 ] studies the psychology of financial decision-making process. It states that market participants are not rational and are subject to multiple behavior biases, which further afect the decision-making. Some typical biases observed are overconfidence bias, herding bias and status quo bias. Overconfidence bias occurs when market participants overestimate their intuitive ability and underestimate risk. Herding is when individuals follow the crowd’s decision. Status quo bias refers to the tendency to stay in current status and unwillingness to make changes.

Financial advisors usually identify behavior biases from customer statements and question-askings in a conversation with them. The optimal suggestions will be given by taking the behavior biases into account. Besides, we believe more dynamic interests and subjective requirements can be observed in a conversation. Inspired by that, in this paper we propose a heterogeneous conversational recommender system (HConvoNet), which integrates unstructured conversation with structured profile and make more adaptive recommendations. In brief, our proposed framework consists of two modules: customer profile module and conversation module. The profile module aims to capture customer’s important static needs, while the conversation module aims to extract behavior biases and dynamic interests. This is feasible since most companies have stored huge amount of conversation data from routine businesses, like telemarketing.

We model the structured profile data in a deep way, adopting DeepFM framework[ 5 ]. To capture the information embedded in a conversation comprehensively, we build the architecture using a two-level bidirectional Gated Recurrent Unit (GRU) with selfattention mechanism. The lower level encodes each single utterance and the upper level encodes the whole conversation considering contextual interactions among utterances.

We conduct the experiments on three internal datasets from Ping An Insurance, ESB, Wuyou and Anxin, which are popular insurance products in Ping An Insurance. In a conversation between insurance agent and customer, agent usually asks multiple questions in order to infer the insurance needs and preferences. The objective is to predict customer’s purchase intention. The baseline models include industry popular methods and some variants of HConvoNet. Results show that our proposed model has a significant improvement over the baselines.

To summarize, we make the following main contributions: • We propose an innovative heterogeneous conversational recommender system (HConvoNet) for financial products, which adapts customer behavior biases and dynamic interests. • The proposed HConvoNet integrates structured customer profile data and unstructured conversation data and adopts cutting-edge NLP techniques. • The proposed HConvoNet can be applied to most practical cases and has huge commercial value. 2

RELATED WORK

In this paper, we propose an innovative heterogeneous conversational recommender system (HConvoNet) for financial products. The most related domains are recommender system and textual information extraction. In this section, we will discuss the related work.

Recommender System. Factorization Machines[ 13 ] is a classical approach to model feature interactions using factorized parameters. Field-aware FM[ 9 ] is one of the variants of FM, which adds the field index into feature space. FNN[ 22 ], Wide & Deep[ 1 ] and DeepFM[ 5 ] are examples of using deep neural networks to learn more complex feature interactions. Deep learning techniques have also been applied to collaborative-filtering and content-based recommendation like [ 17 ] and [ 21 ]. [ 6 ] exploited RNN to develop a session-based recommender system. [ 19 ] uses RNN to build a recommender system for movie recommendation. Google develops a twostage deep learning framework for YouTube video recommendation[ 3 ]. [ 18 ] and [ 23 ] propose hybrid models, which use deep learning to learn features of various domains.

However, most of the researches are exploiting the objective item/user nature and ignore unstructured data, which is subjective to user and afect the decision. There are some work focus on mining short text review to capture user sentiment like [ 12 ], but these approaches are not suitable for financial products.

Textual Information Extraction. Recurrent neural networks (RNN) is a standard way to extract sequential information. [ 14 ] extends RNN to a bidirectional RNN. [ 7 ] proposes the framework of long short-term memory (LSTM). Gated recurrent unit (GRU), proposed by [ 2 ] is seen as a better network to capture long sequential relationships. All these networks have succeeded in many natural language processing tasks. [ 11 ] uses LSTM for sentiment analysis. [ 24 ] proposes to use biLSTM to extract relationship. [ 4 ] uses biLSTM for speech recognition. [ 8 ] uses GRU for emotion recognitition and [ 20 ] uses GRU for document classification. 3 3.1

PROPOSED FRAMEWORK Problem Definition

The dataset contains unstructured conversation transcripts and structured profile data, D = {Cs , Ps , ys }sN=1, where N is the number of samples, Cs , Ps and ys represent the conversation transcript, structured profile data and label of sample s respectively. Each conversation contains multiple utterances said either by the agent or customer, C = {ui }in=1, where ui represents utterance i and n is number of utterances in the conversation. Each utterance consists of multiple words, ui = {wi, j }jK=i1, where Ki is the number of words in utterance i. We aim to use heterogeneous data to predict customer’s preference.

The overall architechture of our proposed framework can be seen in Figure 1. The framework can be explained in three parts: the profile module, conversation module and fusion part. We will clarify each one in the following content. 3.2

Profile Module

The profile module takes the form of DeepFM[ 5 ], which has two parts: FM part and DNN part.

FM part. FM is good at handling sparse data and can model the ifrst-order impact and second-order interactions among all features. According to [ 13 ] and [ 5 ], FM can be expressed as: d Õ d Õ j1=1 j2=j1+1 yF M =< ω, x > + < Vi , Vj > xj1 · xj2 (1) where ω and Vi are parameters to estimate, ω ∈ Rd , Vi ∈ Rk (k is given as the feature embedding size) and <, > is the dot product.

Profile Module

yprof

Profile Embedding FM Part 1st-order 2nd-order

DNN Part DenseE1

DenseE2 ...

DenseEm ...

Field m

Softmax Fully Connected

Conversation Module h1 h1 u1 h2,1 h2,1 e(w2,1)

yconvo Conversation Embedding Self-Attention with Max Pool

Bidirectional

GRU ...

Utterance Embedding ...

hn hn un h2 h2 u2 h2,2 h2,2 Self-Attention with Max Pool

Bidirectional GRU h2,k h2,k e(w2,2) ... Word

Embedding... e(w2,k)

Conversation Level Utterance Level

DNN part. DNN part models the more complex non-linear interactions among feature embeddings. Feed the output of embedding layer into the deep neural network and follow the forward process: al +1 = σ (Wfl · al + bfl ) where σ is the activation function, l is the layer depth, Wf is the weight matrix and bf is the bias. We use Relu as the activation function and take output of the last layer aL as DNN part representation yD N N .

The final representation of the customer profile is the concatenation of both FM part and DNN part.

ypr of = [yF M ; yD N N ]

3.3 Conversation Module

The conversation module takes advantage of the cutting-edge natural language processing techniques. It can be seen as a two-level bidirectional GRU. The lower level encodes each single utterance and the upper level encodes the whole conversation considering contextual interactions among utterances. Besides, we propose the use of self-attention mechanism[ 16 ] to focus on more important information in utterance level embedding and conversation level embedding.

Uterance level . Suppose a single utterance ui contains Ki words, ui = {wi, j }jK=i1, where Ki is the number of words in utterance i. For (2) (3) each word wi, j , we have: → → h i, j = GRU (e(wi, j ), h i, j−1) ←hi, j = GRU (e(wi, j ), ←hi, j+1) (5) where e(wi, j ) is the word embedding obtained from pre-trained word embeddings. The forward and backward hidden states are → ← concatenated into hi, j = [ h i, j ; h i, j ]. Suppose the dimension of a unidirectional hidden state is m. Then hi, j has a dimension of 2m.

We apply self-attention mechanism[ 16 ] to the concatenated hidden states to pay more attention to important words. We denote Hi = (hi,1; hi,2; ...; hi, j ; ...; hi,Ki ), where Hi ∈ RKi ×2m . The weight matrix in self-attention mechanism is calculated as :

Ai = so f tmax

Hi · HiT ! √2m where Ai ∈ RKi ×Ki and √2m is a scale factor. The self-attended hidden states for words is then computed as:

Hisa = Ai · Hi (7) where Hisa will have the same shape of Hi , which is Ki × 2m. The single utterance embedding is then obtained by max-pooling over all words’ self-attended hidden states:

e(ui ) = maxpool (Hisa ) where e(ui ) ∈ R2m . (4) (6) (8) Conversation level. We find the conversation embedding by a similar way as utterance embedding. Suppose a conversation consists of n utterances. We feed utterance embeddings obtained from the previous step into another bidirectional GRU: → → h i = GRU (e(ui ), h i−1) ←hi = GRU (e(ui ), ←hi+1) (10) We concatenate the forward and backward hidden states hi = → ← [ h i ; h i ] and represent all concatenated hidden states as a n × 2m matrix H .

Again, utterances are not of the same importance. We apply self-attention mechanism to learn the relative weights: A = so f tmax

H · HT √2m where √2m is a scale factor. The self-attended hidden states matrix for utterances is then computed as:

Hisa = A · H The final conversation embedding is obtained by max-pooling over all utterances’ hidden states: (12) yconvo = maxpool (H sa ) 3.4

Making Prediction

To generate the prediction, we concatenate the outputs from both profile module and the conversation module and feed into a FullyConnected layer followed by a softmax function:

yf inal = [ypr of ; yconvo ] yˆ = so f tmax (W · yf inal + b) The categorical cross-entropy is used as the loss function: loss = − ÕN C

Õ yi, jloд(yiˆ, j ) i=1 j=1 where yi, j and yiˆ, j are the groundtruth and prediction. 4 4.1

EXPERIMENTS Dataset

We conduct our experiments on three internal datasets from Ping An Insurance, ESB, Wuyou and Anxin. ESB, Wuyou and Anxin are three popular insurance products in Ping An Insurance. ESB is a kind of medical insurance. Wuyou is an universal insurance product, which has some investment feature. Anxin is an accident insurance. All three datasets contain unstructured conversation data and structured customer profile data. Labels are collected according to customer’s purchase records after conversation within 15 days. The objective is to predict the customer’s purchase intention, given his profile and conversation data. The time window of our datasets is May 2019. Due to the unbalanced distribution, we further downsample datasets to a rough ratio of 1:5. Table 1 provides the detailed information about each dataset. We randomly take 80% as the training set and 20% as the test set. We further partition the training set into development set and validation set with a 80/20 ratio. (9) (11) (13) (14) (15) (16) We follow the general feature engineering process to preprocess the structured data. We preprocess the conversation data by the following steps: (1) We first extract the textual transcripts of the conversation audio using Automatic Speech Recognition (ASR) technique and clean the data due to some noises introduced by the previous step; (2) We segment each utterance into tokens by jieba package and add some business terminologies; (3) We remove all non-alphanumerics, stop words and the words with frequency lower than two; (4) We use the publicly available 300-dimensional word2vec1 vectors trained on a large corpus across various domains. Words not in word2vec are randomly initialized.

Training. We adopt Adam[ 10 ] as the optimizer and set the initial learning rate to 2 ∗ 10−4. An annealing strategy is utilized by decaying the learning rate by half every 20 epochs. For regularization purpose, we apply dropout with a rate of 0.5. Early stopping with a patience of 10 is adopted to terminate training based on the F-measure of the validation set.

Evaluation Metrics

We adopt F-measure as our evaluation metric. F-measure is the harmonic average of precision and recall and is often used for measuring performance in industry and many research fields. 4.6

Results

We also test the impact of self-attention and diferent pooling method. Table 3 presents the performances on three datasets. We see HConvoNet achieves better performance over HConvoNetnsa, indicating the efects of self-attention mechanism. Comparing meanpool and maxpool, we find that the diference is negligible. Our proposed HConvoNet succeeds in most cases. 5

CONCLUSIONS

In this paper, we propose an innovative heterogeneous conversational recommender system (HConvoNet) for financial products. We improve the traditional recommendation by integrating unstructured conversation data with structured profile data, thus considering customer static needs, behavior biases and dynamic interests. Future work could include exploring diferent methods to fuse heterogeneous data and involving multiple modalities of the conversation, like audio.

[1] Heng-Tze

Cheng

, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson , Gregory S. Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah . 2016 . Wide & Deep Learning for Recommender Systems . In DLRS@RecSys.

[2]

Kyunghyun

Cho , Bart van Merrienboer, Dzmitry Bahdanau , and Yoshua Bengio . 2014 . On the Properties of Neural Machine Translation: Encoder-Decoder Approaches . ArXiv abs/1409 .1259 ( 2014 ).

[3]

Paul

Covington , Jay Adams, and

Emre

Sargin . 2016 . Deep Neural Networks for YouTube Recommendations . In RecSys.

[4]

Alex

Graves , Navdeep Jaitly, and Abdel rahman Mohamed . 2013 . Hybrid speech recognition with Deep Bidirectional LSTM . 2013 IEEE Workshop on Automatic Speech Recognition and Understanding ( 2013 ), 273 - 278 .

[5]

Huifeng

Guo , Ruiming Tang, Yunming Ye,

Zhenguo

Li ,

and Xiuqiang

He . 2017 . DeepFM: A Factorization-Machine based Neural Network for CTR Prediction . ArXiv abs/1703 .04247 ( 2017 ).

[6]

Balázs

Hidasi , Massimo Quadrana, Alexandros Karatzoglou, and

Domonkos

Tikk . 2016 . Parallel Recurrent Neural Network Architectures for Feature-rich Sessionbased Recommendations . In RecSys.

[7]

Sepp

Hochreiter and

Jürgen

Schmidhuber . 1997 . Long Short-Term Memory . Neural Computation 9 ( 1997 ), 1735 - 1780 .

[8]

Wenxiang

Jiao , Haiqin Yang,

Irwin

King ,

and Michael R.

Lyu . 2019 . HiGRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition . In NAACL-HLT.

[9] Yu-Chin

Juan

, Yong Zhuang, Wei-Sheng Chin , and Chih-Jen Lin . 2016 . Fieldaware Factorization Machines for CTR Prediction . In RecSys.

[10] Diederik

Kingma and Jimmy

Ba . 2014 . Adam: A Method for Stochastic Optimization . CoRR abs/1412 .6980 ( 2014 ).

[11] Soujanya

Poria

, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency . 2017 . Context-Dependent Sentiment Analysis in User-Generated Videos . In ACL.

[12]

Preethi ,

Venkata Krishna , Mohammad S. Obaidat, Vankadara Saritha, and

Sumanth

Yenduri . 2017 . Application of Deep Learning to Sentiment Analysis for recommender system on cloud . 2017 International Conference on Computer, Information and Telecommunication Systems (CITS) ( 2017 ), 93 - 97 .

[13]

Stefen

Rendle . 2010 .

Factorization

Machines . 2010 IEEE International Conference on Data Mining ( 2010 ), 995 - 1000 .

[14]

Mike

Schuster and Kuldip

Paliwal . 1997 . Bidirectional recurrent neural networks . IEEE Trans. Signal Processing 45 ( 1997 ), 2673 - 2681 .

[15]

Shefrin and Oxford University Press. 2002 . Beyond Greed and Fear: Understanding Behavioral Finance and the Psychology of Investing . Oxford University Press. https://books.google.com/books?id=hX18tBx3VPsC

[16] Ashish

Vaswani

, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N.

Gomez , Lukasz Kaiser, and

Illia

Polosukhin . 2017 . Attention Is All You Need . In NIPS.

[17] Hao

Wang

Naiyan

Wang , and Dit-Yan Yeung . 2014 . Collaborative Deep Learning for Recommender Systems . In KDD.

[18]

Xinxi

Wang and

Wang . 2014 . Improving Content-based and Hybrid Music Recommendation using Deep Learning . In ACM Multimedia.

[19] Chao-Yuan

, Amr Ahmed, Alex Beutel,

Alexander J.

Smola , and

How

Jing . 2017 . Recurrent Recommender Networks . In WSDM.

[20] Zichao

Yang

Diyi

Yang , Chris Dyer, Xiaodong He, Alexander J. Smola , and Eduard

Hovy . 2016 . Hierarchical Attention Networks for Document Classification . In HLT-NAACL.

[21] Fuzheng

Zhang

, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma . 2016 . Collaborative Knowledge Base Embedding for Recommender Systems . In KDD.

[22] Weinan

Zhang

, Tianming Du, and

Jun

Wang . 2016 . Deep Learning over Multiifeld Categorical Data: A Case Study on User Response Prediction . ArXiv abs/1601 .02376 ( 2016 ).

[23] Lei

Zheng

, Vahid Noroozi, and Philip

Yu . 2017 . Joint Deep Modeling of Users and Items Using Reviews for Recommendation . In WSDM.

[24] Peng

Zhou

, Wei Shi, Jun Tian, Zhenyu Qi,

Bingchen

Li ,

Hongwei

Hao , and

Xu . 2016 . Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification . In ACL.