1. Introduction

Target-guided Knowledge-aware Recommendation Dialogue System: An Empirical Investigation

Dongding Lin

Jian Wang

Wenjie Li

0 0 Department of Computing, The Hong Kong Polytechnic University

2021

The target-guided recommendation dialogue system aims to make high-quality recommendations through interactive conversations proactively and naturally. Existing methods still struggle to incorporate background knowledge for coherent response generation, and to recommend appropriate items with respect to dialogue context, user preference and recommendation target. In this paper, we investigate the problem of target-guided knowledge-aware recommendation dialogue and design a dialogue generation system to alleviate the above-mentioned issues. Specifically, we employ pre-trained language models with multi-task learning to jointly learn response generation and goal prediction towards the target. We also present a knowledge-preserving encoding strategy to maintain the facts in background knowledge. Extensive experiments on two benchmark datasets show that our system significantly outperforms various competitive models in terms of both automatic and manual evaluations. We further provide analysis and discussions to demonstrate that our system is efective in leveraging both related knowledge and planned goals to generate fluent, informative and coherent responses towards the target of recommendation.

eol>Recommendation Dialogue Background Knowledge Target Guiding Multi-task Learning

1. Introduction

knowledge helps a dialogue system better understand user interests and make recommendations via coherent Building a human-like dialogue system is one of the long- natural language communication. Since users may not cherished goals in natural language processing (NLP) have a clear preference for the unfamiliar new items [1]. Dialogue systems can be mainly used for chat- recommended, especially in many sociable recommendating with users for entertainment, i.e., open-domain di- tion domains such as music, movies and news, it is also alogues [2, 3], or accomplishing specific tasks, i.e., task- important for a dialogue system to proactively lead the oriented dialogues [4, 5, 6]. Recent years, recommen- conversation to the recommendation target with high dation dialogue systems [7, 8] have been recognized as user engagement and enjoyment. an important special type of task-oriented dialogue sys- Recently, the emergence of the DuRecDial [15] dataset tems with the aim of discovering user preferences and provided new insights towards the development of targetmaking recommendations through conversations. The guided knowledge-aware recommendation dialogue sysgrowing research interests mainly come from the bene- tems. As the example shown in Figure 1, the whole userifts that dialogue provides an efective channel to han- bot dialogue is grounded on a user profile, background dle the cold-start problem in recommendations while knowledge, and a goal sequence. The bot needs to take recommendation-oriented tasks promotes technological both the user’s interests and the knowledge graph into advance for dialogue systems [9]. consideration to decide an optimal goal path to achieve

Many existing methods have focused on various as- the target of recommending and playing a music (i.e, pects of both recommendation and conversation, includ- “Days of Friendship”). Here, the goal path is a sequence of ing user preference modeling [10], conversation strategy goals, with each goal specifying a goal type (e.g., “Greet[11, 12], and dialogue generation [13, 14]. Today, incor- ing” or “Movie Recommendation”) and a goal topic (e.g., porating knowledge graphs (KG) has been recognized the movie “Orphans of the Zhao Family” or the music as crucial to bridge the gap between recommendation “Ode to Joy”). It is challenging to build such an efecreasoning and response generation [13]. Background tive recommendation dialogue system, as the planning of goals which is required to ensure a natural and smooth conversation must take into consideration of both the user interests and the ultimate target goal. The joint modeling of dialogue context, user profile and recommendation target remains under-explored in the existing work.

In this paper, we investigate two key challenges in recommendation dialogue systems: (1) how to efectively

UNAIDS China

Goodwill Ambassador

User Profile Ren Xinfeng Female 36-50 Huang Xiaoming, Yang Mi Legend of the Grand Master You Hate Me Like That.

Four Seasons of the Earth.

Knowledge Graph Has a good reputation.

comment Orphans of the Zhao

Family award Chinese Film Media Awards

comment As a fellow, I will always support

you.

The theme song of the movie "Young and Dangerous"

award act XiHauoamnigng sing Ode to Joy

sing comment Days of Friendship comment 12th China Film

Society Of Performing Arts

Award It conveys a lot of positive energy.

The 19th top ten Chinese Golden Melody in 1996.

Goal Conversation

1. Bot: Good evening, Ms. Ren Xinfeng.

Greeting 23.. BUoset:r: GHooowdisevitentoindga.y?

4. User: I feel like the best day of the month because I got my salary today!

Movie 5. Bot: iHsavhear,ylenti’csewaantdchhaasmaogvoieodanredpruetlaatxi.onT.hIerreeccoemntmoenned,"itOtropyhoaun.s of the Zhao Family", Recommendation 6. User: I don’t want to talk about movies now. Sorry.

7. Bot: Okay, how about its actor Huang Xiaoming? He’s a good actor and won the 12th

China Film Society Of Performing Arts Award.

Chat about 8. User: Really? I like him very much. the Star 9. Bot: Many people like him. He is also the UNAIDS China Goodwill Ambassador. He is

very kind-hearted. 10. User: Really? He is the perfect idol in my mind. 11. Bot: And Huang Xiaoming sings very well. This song "Ode to Joy" is recommended to

you, which conveys a lot of positive energy.

12. User: I'm not interested in this song, sorry.

Music 13. Bot: What about the "Days of Friendship" he sang? The song won the 19th top ten Recommendation 14. User: CRehainlleys?eIGtsoolduenndsMgeoloodd.y in 1996.

15. Bot: Many people likes it. It’s very popular and worth listening to. 16. User: Yeah, I want to have a try.

17. Bot: Should I play it for you? Play Music 1198.. BUoste:r: OYeksa,yl,eitt’’sspallaryeaidt.y playing. Please enjoy it.

20. User: Wow, it’s really nice!

Goodbye 2212.. BUoste:r: HOakhaay,,gtohoendbIywe.on't disturb you listening to the song. Goodbye. incorporate related facts in background knowledge and egy to better maintain background knowledge facts user profile in dialogue generation, and (2) how to make in order to enhance the system ability to generate appropriate goal planning to proactively lead the con- appropriate responses by incorporating background versation. To this end, we employ multi-task learning to knowledge. jointly learn response generation and goal prediction towards the nfial target based on pre-trained language mod- (3) The evaluation results show that our system achieves els. Specifically, we adopt ERNIE-GEN [ 16], an enhanced significant improvement compared to various commulti-flow pre-training and fine-tuning framework for petitive models. natural language generation, as our backbone model. In addition, we also present a knowledge-preserving encod- 2. Related Work ing strategy to maintain the background knowledge facts for dialogue generation. Extensive experiments on two The two research lines that motivate our study are conbenchmark datasets show that our system significantly versational recommendation systems (CRS) and recomoutperforms various competitive models in terms of both mendation dialogue systems. We briefly introduce some automatic and manual evaluations. We have submitted representative works as below. our best model to Baidu Language and Intelligence Challenge 2021 (LIC 20211), where we achieved the 4-th rank 2.1. Conversational Recommendation among 862 teams. It reveals that our methods are efective to generate informative, coherent and appropriate System responses and to achieve the target of recommendation. A conversational recommendation system (CRS) is a rec

Overall, our contributions are three folds: ommendation system (RS) that provides personalized rec(1) Towards building a target-guided recommendation di- ommendation through natural language conversations. alogue system, we adopt multi-task learning to jointly Christakopoulou et al. [17] argued that asking quesmodel goal planning and dialogue generation based tions benefit a RS , which can better understand user on pre-trained language models. preference based on user feedback. To this end, they suggested to move from the traditional RS to the CRS. (2) We present a knowledge-preserving encoding strat- Lei et al. [11] proposed a three-stage framework called Estimation-Action-Reflection (EAR) to fill the interaction gap between conversation and recommendation in

1https://aistudio.baidu.com/aistudio/competition/detail/67?is

FromLuge=true implicit ways. More explicitly, Lei et al. [12] leveraged set and each goal , consists of a goal type and a conversational recommendation as finding a path in a goal topic. = {, }=1 represents a set of user prouser-item-attribute graph interactively. To enhance se- files with each profile , in the format of ⟨, ⟩ mantic representations of products and related textual de- pair. = {, }=1 is the response produced on the scriptions of products, both Zhou et al. [18] and Rajdeep basis of the , , , and . Here, and denote et al. [19] incorporated external knowledge graphs (KG) the sequence length of and respectively. into CRS, which in turn led to better recommendations. Given explicit goals ′ = {1, } (i.e., start goal However, despite the improvement towards high-quality and target goal), a dialogue history ′ paired with the recommendations, these methods have limited abilities related knowledge facts ′, and the user profile ′, the to generate natural and informative dialogues. objective of target-guided recommendation dialogue is to decide an appropriate goal at each turn to determine 2.2. Recommendation Dialogue System where the dialogue should go with the aim of proactively leading the dialogue from the start goal to the target goal, A recommendation dialogue system is a special type of and meanwhile generating a coherent and informative task-oriented dialogue system, which is expected to en- response to achieve the goal . courage natural human-machine interaction with a clear target. To facilitate the research along this line, several 3.2. Model Architecture recommendation dialogue datasets have been released, including GoRecDial [8] and INSPIRED [20]. To further Backbone Model To tackle the issue of proactively investigate whether the system can lead a multi-type di- planning goals for target-guided recommendation diaalogue to approach the target of recommendation with logue, we jointly model goal prediction and dialogue rich interaction behavior, Liu et al. [15] created a large- generation based on pre-trained language models, as scale dialogue dataset, namely DuRecDial. Existing rec- shown in Figure 2 (a). Pre-trained language models have ommendation dialogue approaches mainly focus on how been widely used in dialogue generation on a basis of to efectively integrate interactive recommendation and pre-training fine-tuning framework, where they generdialogue generation. Cai et al. [21] contributed two hi- ally concatenate diferent information sources such as erarchical taxonomies for classifying user intents and knowledge facts and dialogue history as input, and generrecommendation actions. To bridge the gap between rec- ate responses autoregressively. In this paper, we employ ommendation reasoning and response generation, Ma et ERNIE-GEN [16], an enhanced multi-flow pre-training al. [13] performed tree-structured reasoning on knowl- and fine-tuning framework for natural language generaedge graphs, which can then be mapped to hierarchical tion, as our backbone framework. ERNIE-GEN bridges dialogue acts to guide generation. More recently, Bai the discrepancy between training and inference with an et al. [14] proposed a goal-oriented knowledge copy infilling generation mechanism using multi-flow attennetwork to discern the knowledge facts that are highly tion. In light of the fact that entities and phrases are orcorrelated to the dialogue, which assisted to generate ganized span by span, ERNIE-GEN adopts span attention accurate knowledge-aware responses. mask matrices (see Figure 2 (b)) to determine whether

In this paper, we aim to build a target-guided dialogue each token and each span can attend to each other. To system towards recommendation. It requires the system better capture coherent semantic information of the conto make high-quality recommendations by considering text, both word-by-word flow and span-by-span flow are external knowledge and user preference. More impor- integrated together (see Figure 2 (c)), where the the spantantly, the system should also be able to lead the conver- by-span generation flow aims to predict semantically sation towards the target goal naturally by generating complete spans consecutively. In view of the fact that appropriate responses. specific entities or spans (e.g., musics, movies, and news) should be generated in the response as the recommended items, we believe ERNIE-GEN is a good choice with the 3. Method advantages described above.

3.1. Problem Definition

Suppose a target-guided dialogue corpus is denoted as = {(, , , , )}=1, where = {ℎ,}=1

represents dialogue history with multiple turns, = {, }=1 is a set of background knowledge facts that correspond to this conversation and each element , is formulated as a triplet. = {, }=1 is a goal sequence which is constructed upon the knowledge facts Knowledge-preserving Encoding It is dificult for existing pre-trained models to encode concatenated background knowledge facts because it often exceeds the encoding length limitation of the models. In particular, according to our statistics, the concatenated background knowledge facts of each dialogue in the DuRecDial [15] dataset contains more than 1,700 tokens on average. It substantially exceeds the encoding length limitation (i.e.,

Feed Forward

Response

Infilling Decoding Multi-Flow Attention in ERNIE-GEN

Feed

Forward

Pre-trained Language Model

Supervised

Training Data (a) Overview of our system Q K, V R S T U

Q K, V R S T U R S T U R S T U Can be attended

Cannot be attended

Y!"#$

Y%&'( ×

Feed Forward

Multi-Flow Attention Word-by-Word

Flow Contextual Flow share

Span-by-Span

Flow

Contextual Flow A) + , + P % * + ,

A+ + , (b) Span attention mask matrices (c) Overview of Multi-Flow Attention 512) of many pre-trained language models including Multi-task Learning As described in Section 3.1, the ERNIE-GEN [16]. To address this issue, we present a system should generate a coherent and informative reknowledge-preserving encoding strategy to better main- sponse following an appropriate goal, which is decided by tain background knowledge facts. First, all knowledge the system itself at each turn with the aim of proactively triplets in -th dialogue are sorted according to the to- leading the dialogue from the start goal to the target goal. ken length of each triplet after being concatenated. Then, Intuitively, the goal planning process has an important we put these knowledge triplets into a number of buckets efect on dialogue generation. To this end, we propose {, }=1 with the short-length-first-in priority, where to add goal prediction at each turn as an auxiliary task, , ⊆ , denotes the number of buckets. The ca- which are jointly fine-tuned with the dialogue generapacity of each bucket , is tuned by a hyper-parameter tion task in a multi-task learning manner. Concretely, we , which denotes that , contains no more than divide the task of goal prediction into two sub-tasks, goal tokens in total. We hope that after concatenating , type prediction and goal topic prediction. We feed the with other information sources, the total input length hidden representation of ERNIE-GEN’s encoding output fulfills the encoding length limitation (i.e., 512). To this to two individual fully-connected feed forward neural end, the -th dialogue sample is split into dialogue networks, followed by a softmax operator, both of which samples. Note that the system will generate multiple are optimized using cross-entropy loss. As shown in Figresponses during inference with this strategy. We adopt ure 2 (a), the fine-tuning objective during the training a simple unsupervised strategy to select the “best” one. stage is to jointly optimize the goal type prediction loss We calculate mutual F1 scores by treating one response ℒ, goal topic prediction loss ℒ, and the response as the “ground-truth” and the others as the candidate generation loss ℒ. We minimize the following overall generated results. The average F1 score of the candidate loss: results will be regarded as the selection score for the ℒ = 1ℒ + 2ℒ + ℒ (1) “ground-truth”. Therefore, each response will obtain a corresponding selection score. We select the response with the highest selection score as the final generated response. where 1, 2 are two hyper-parameters controlling the impact of the goal type and the goal topic. Under the supervision of goal planning in the training stage, the system will learn to naturally generate coherent responses so as to achieve goal transition during the inference stage.

4. Experiments 4.1. Datasets

We conduct extensive experiments on two knowledgeaware recommendation dialogue datasets, i.e., DuConv [22] and DuRecDial [15] that are accompanied with the explicitly specified goals. We also use some other dialogue datasets to enhance the fine-tuning process. All datasets are in Chinese. • DuConv: It consists of about 30k dialogues and 270k utterances in the movie domain. Each dialogue contains about 14 background knowledge triplets on average. The goal sequence of each dialogue is an explicit path “[] → _ → _” over the knowledge graph, indicating how a dialogue is led from any start point relevant to _ to the final _. Here, represents one entity in the background knowledge. • DuRecDial: It is composed of about 10k dialogues and 156k utterances over multi-type domains, including chit-chat, question answering (QA), and music/movie/news recommendation, etc. Each dialogue session consists of about 15 turns on average, with about 22 background knowledge triplets and a speciifed user profile (e.g., age, gender, preference) in the format of ⟨, ⟩ pairs. The goal sequence is constructed upon the knowledge and user profiles, with each goal containing a goal type and a goal topic (entity). There are altogether 21 goal types. • Other Datasets: Since it is important to select appropriate entities or phrases from background knowledge facts for a recommendation dialogue system, we also utilize additional large-scale dialogue datasets to help ifne-tune our system because of their similar settings for incorporating knowledge in dialogue generation. The datasets include ESTC [23], Tencent [24], and KdConv [25]. Both ESTC and Tencent datasets are collected from open-domain conversations, with about 900k and 5.5M dialogues respectively. The KdConv dataset covers conversations about movie, music, and tourism, which has more than 3k dialogues. We will discuss the efect of model performance with and without using these datasets in Section 5.1. dialogue history, goal topics, knowledge triplets as well as the input sequence, we observe that it fulfills the maximum input length (i.e., 512) of ERNIE-GEN in most cases. For those few samples that exceed the length limitation, we simply take the last 512 tokens (i.e., Chinese characters) as input. For the DuRecDial dataset, it has an average of about 15 dialogue turns and an average of about 22 knowledge triplets. After taking user profiles and goal sequences into account, the average length of each concatenated input sequence substantially exceeds the maximum input length (i.e., 512) of ERNIE-GEN. Therefore, we adopt the knowledge-preserving encoding strategy described in Section 3.2 to better maintain the background knowledge facts for each dialogue.

4.3. Baselines

We compare our system with baseline models and several competitive methods as follows. • Seq2Seq2 [26] is a generative baseline used in many dialogue generation tasks. We concatenate dialogue history, knowledge facts, and other sources (if any) together as the input sequence and feed it to the vanilla sequence-to-sequence (Seq2Seq) model with the attention mechanism to generate responses. • MGCG_R/G [15] include a retrieval-based model and a generation-based model for multi-goal driven conversation generation. They are presented as the second baseline on the DuRecDial dataset. • UniLM [27] is a unified pre-trained language model that can be used for language generation by controlling generation with specific self-attention masks. • GPT-2 [28] is an autoregressive pre-trained language model and has been successfully used in many downstream language generation tasks. The pre-training on large-scale text corpora makes it easy to be fine-tuned for dialogue generation. • GOKC3 [14] is a generation-based model with a goaloriented knowledge discernment mechanism, which discerns the knowledge facts that are highly correlated to the dialogue goal and the dialogue context. Note that GOKC is the publicly available state-of-the-art model on both the DuConv dataset and the DuRecDial dataset.

4.2. Data Preprocessing 4.4. Implementation Details To better understand the characteristics of diferent

datasets, we conduct data analysis and preprocessing ifrst. The statistics of DuConv and DuRecDial datasets are reported in Table 1. For the DuConv dataset, it has an average of 4.5 dialogue turns and an average of 14.2 knowledge triplets. After concatenating the multi-turn

Our dialogue system is built on top of the oficial

open-source code of ERNIE-GEN4. During training (finetuning), both 1 and 2 are set to 1.0 and the batch size

2https://opennmt.net/OpenNMT-py/. 3https://github.com/jq2276/Learning2Copy 4https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen

is set to 8. We use Adam [29] optimizer with the initial • BLEU-1/2 scores: They are also calculated at the charlearning rate of 1 × 10− 4, the 2 weight decay of 0.01 acter level, representing 1-gram and 2-gram overlaps and the learning rate warm-up over the first 10% train- between the generated response and the gold response. ing steps with linear decay. During generation, we adopt beam search decoding algorithm with a beam size of 5. • Distinct (DIST)-1/2 scores: They are used to evaluThe details are described below. ate the 1-gram diversity and 2-gram diversity of the generated response.

Fine-tuning We start fine-tuning from the pre-trained • Perplexity (PPL): It is widely used to estimate how Chinese version of ERNIE 1.0 model [30], as it is compat- well a probability model predicts a sample. A low ible with the ERNIE-GEN framework and its pre-trained perplexity indicates the model is good at predicting model checkpoint can be directly loaded. We first fine- the sample. tune our system on 3 large-scale dialogue datasets (as described in Section 4.1) for 5 epochs. Due to the large Human Evaluation With human evaluation, we ransize of the ESTC dataset and the Tencent dataset, we domly select 100 dialogue samples from the testset, and randomly extract 400K dialogue samples from each origi- then invite 5 evaluators to independently assign the ratnal dataset. We continue to fine-tune our system on the ing score for the output of each model following the mettarget dialogue datasets (DuConv and DuRecDial) for 10 rics suggested in [15]. The score of each metric is ranged epochs, with the bucket capacity setting to 360. from 0 to 2. Furthermore, we also report the human evaluation in the Baidu LIC 2021, where crowed-sourcing Vocabulary Expansion We find that ERNIE-GEN annotators are invited to conduct about 10 multi-round may generate unknown words (i.e., [UNK]), i.e., the conversations with each submitted system and to judge words out of the vocabulary. Therefore, we add the ad- the dialogue quality. The metrics used in our evaluation ditional tokens with high occurrence extracted from the and in Baidu LIC 2021 are in consistent, including: datasets to expand the original vocabulary. The final vocabulary size is 18,000, which can cover almost all the • Informativeness (Info.): It measures if the model Chinese characters and common special tokens in the makes full use of knowledge facts in the generated datasets. response.

Deduplication We observe that our system tends to generate repeated words or phrases sometimes, which is a common issue that is still under exploration in natural • Knowledge accuracy (Know Acc.): It evaluates the language generation. To make the generated response accuracy of the selected knowledge in the generated look more fluent, we remove the consecutive repeated response. words using regular expression rules. • Coherence (Cohe.): It measures the overall fluency of the whole dialogue generation.

4.5. Evaluation Metrics

Automatic Evaluation Following the common practice [15, 14], we adopt the following automatic evaluation metrics. • F1 score: It indicates whether the model can generate appropriate entities in the response. • Recommendation success rate (Rec. Succ.): It estimates how well the target recommendation goal is achieved.

5. Results and Analysis 5.1. Automatic Evaluation

The automatic evaluation results on the DuConv dataset and the DuRecDial dataset are reported in Table 2. improvements on F1, BLEU-1, and BLEU-2 evaluation metrics. When using external dialogue datasets, our model achieves about 3%, 15.3%, and 26.7% improvements accordingly. Our model with multi-task learning further outperforms baseline methods on all metrics, which Our model outperforms all the compared models, and demonstrates that joint modeling of goal planning and achieves a significant improvement over most of the eval- dialogue generation is efective to help the system select uation metrics. Specifically, on the DuConv dataset, the the appropriate knowledge from the background facts to normalized models (i.e., norm retrieval and norm gen- facilitate generation. Besides, we also observe that the eration) refer to using normalized data by replacing the perplexity of our model is much lower, indicating that specific two goals in the knowledge path with “topic_a” our model is more likely to generate fluent responses. It and “topic_b” respectively, following [22]. As shown in should be noted that based on its default setting, GOKC Table 2, our model yields substantial improvement over actually assumes that the full goal sequence is provided existing pre-trained models including UniLM and GPT-2 and thus does not require any goal planning [15]. Thereon both F1 and BLEU-1/2. It demonstrates that our model fore, for fair comparison we also report our evaluation can generate more coherent and informative responses results using the available full goal sequence in Table 2. in the -gram’s level. Compared to the state-of-the-art The results further show the efectiveness of multi-task model GOKC, our model without using external dialogue learning for our system. Overall, our model achieves datasets (w/o ext) still achieves about 0.13%, 8%, and 32% significant improvements over competitive methods in improvements in terms of F1, BLEU-1, and BLEU-2, re- terms of all automatic evaluation metrics. spectively. After using external dialogue datasets (w/ ext), our model further achieves 1.39%, 9.7%, and 35.6% improvements of F1, BLEU-1, and BLEU-2 compared to 5.2. Human Evaluation GOKC, which indicates that fine-tuning on large-scale The human evaluation results of baseline models and task relevant dialogue datasets is efective to improve our model are presented in Table 3. As shown in Tathe performance in the final target-guided knowledge- ble 3, our model obtains the highest human scores on aware recommendation dialogue task. Note that the nor- both the DuConv and DuRecDial datasets, which shows malized retrieval method achieves the highest DIST-1/2 the efectiveness to generate informative and coherent scores. The retrieval-based methods that directly select responses with correct knowledge and consistent inforresponses from a list of candidates is more likely to retain mation. Specifically, our model achieves significant imthe diversity of the natural responses. provement in terms of knowledge accuracy, which fur

As shown in Table 2, our model also achieves superior ther verifies that fine-tuning on large-scale task relevant performance than all baseline methods on the DuRec- dialogue datasets is efective to improve the ability of Dial dataset. In particular, compared to the competi- our model to incorporate knowledge into generation. We tive model GOKC, our model without using external observe that our model with multi-task learning obtains dialogue datasets obtains about 2.8%, 15%, and 26.1% much better recommendation success rate on the DuRecDial dataset. This verifies that our joint modeling of goal prediction and response generation enables the system to make more accurate recommendations with respect to the given goals and the user profile.

We submitted our best model to the Baidu LIC 2021 and achieved the 4-th rank among 862 teams. The human evaluation results on the leaderboard are shown in Table 4. Note that the human evaluation here is more challenging due to two aspects: (1) The decision of the current goal relies on the previously predicted goals, and (2) the generation of the response at the current turn will be further decided by the current goal. It is likely to cause error accumulations for a model during multi-turn conversations. Therefore, the evaluation results can better reveal the abilities of diferent models to guide the conversation to the target. As shown in Table 4, our system is competitive compared to others. However, our model performs inferior on the DuRecDial dataset in terms of the recommendation success rate. It encourages us to further improve goal planning strategies in future work.

5.3. Discussion

Analysis of Implementation Details We study the contribution of each part in our system by conducting experiments with several variants of our system. The results are shown in Table 5. Here is our findings. (1) All strategies are efective to improve dialogue generation performance. (2) The knowledge-preserving encoding strategy contributes significantly to dialogue generation especially on the DuRecDial dataset where the input source sequence is much longer. Compared to previous methods that truncate tokens when the sequence exceeds the encoding length limitation of pre-trained models, our proposed encoding strategy better maintains the knowledge facts. (3) After expanding the vocabulary, our model achieves significant improvements over most of the metFuture Research Direction In real world, recommending new target items that possibly attract users is meaningful since users often have no definite preference for many unknown items. We are trying to achieve this objective through the development of the target-guided knowledge-aware recommendation dialogue system. We understand that it is not suficient by simply modeling the target and dialogue with multi-task learning as investigated in this paper. We will leave the problem of proactively planning goals step by step towards the target goal as our future research direction.

6. Conclusion

In this paper, we explore target-guided knowledge-aware recommendation dialogue based on the pre-training finetuning framework, which aims to proactively lead the conversation and learn to make high-quality recommendations. We present a knowledge-preserving encoding strategy and a multi-task learning approach to enable our system to efectively recommend appropriate items and to generate fluent and coherent responses. The experimental results on two benchmark datasets demonstrate the efectiveness and superiority of our system compared to the other competitive models in terms of both automatic and manual evaluations. We also discuss the implementation details and our future research direction.

Acknowledgments

The work described in this paper was supported by Research Grants Council of Hong Kong (PolyU/15207920, PolyU/15207821), National Natural Science Foundation of China (61672445, 62076212) and PolyU Internal Grants (ZVVX, ZG7H, ZVQ0).

mas , W. Wang (Eds.), The Thirteenth ACM Interna [1]

A. M.

Turing , Computing machinery and intelli- tional Conference on Web Search and Data Mining

gence, in: Parsing the turing test, 2009 , pp. 23 - 65 . (WSDM), 2020 , pp. 304 - 312 . [2]

Chen ,

Qin ,

Yan ,

W. Y.

Wang , Se- [12]

Lei ,

Zhang ,

He ,

Miao ,

Wang , L. Chen,

ceedings of the 57th Annual Meeting of the As- Y. Liu,

Tang ,

B. A.

Prakash (Eds.), The 26th ACM

sociation for Computational Linguistics , 2019 , pp. SIGKDD Conference on Knowledge Discovery and

3696- 3709 . Data Mining (KDD) , 2020 , pp. 2073 - 2083 . [3]

Wu ,

Wei ,

Huang ,

Wang ,

Li ,

Zhou , [13]

Ma ,

Takanobu ,

Tu ,

Huang , Bridg-

Artificial

Intelligence , 2019 , pp. 7281 - 7288 . arXiv: 2010 . 10333 ( 2020 ). [4]

Eric ,

Krishnan ,

Charette ,

C. D.

Manning , [14]

Bai ,

Yang ,

Liang ,

Wang ,

Li , Learning

alogue, in: Proceedings of the 18th Annual SIG tion , in: Proceedings of the AAAI Conference on

dial Meeting on Discourse and Dialogue , 2017 , pp. Artificial Intelligence , 2021 , pp. 12535 - 12543 .

37- 49 . [15]

Liu ,

Wang ,

Z.-Y.

Niu ,

Wu ,

Che , T. Liu, To[5]

Madotto , C.-S. Wu , P. Fung, Mem2Seq: Efec- wards conversational recommendation over multi-

tively incorporating knowledge bases into end-to- type dialogs , in: Proceedings of the 58th Annual

of the 56th Annual Meeting of the Association for guistics (ACL ), 2020 , pp. 1036 - 1049 .

Computational

Linguistics (ACL) (Volume 1: Long [16]

Xiao ,

Zhang ,

Li ,

Sun ,

Tian , H. Wu,

Papers) , 2018 , pp. 1468 - 1478 . H. Wang , ERNIE-GEN: an enhanced multi-flow [ 6]

Wu ,

Socher ,

Xiong , Global-to-local memory pre-training and fine-tuning framework for natural

pointer networks for task-oriented dialogue, in: 7th language generation , in: C. Bessiere (Ed.), Pro-

tions (ICLR) , 2019 . Conference on Artificial Intelligence (IJCAI) , 2020 , [7]

Chen ,

Lin ,

Zhang ,

Ding ,

Cen , H. Yang, pp. 3997 - 4003 .

Tang , Towards knowledge-based recommender [17]

Christakopoulou ,

Radlinski ,

Hofmann , To-

dialog system , in: Proceedings of the 2019 Confer - wards conversational recommender systems , in:

ence on Empirical Methods in Natural Language Proceedings of the 22nd ACM SIGKDD Interna-

Processing and the 9th International Joint Con- tional Conference on Knowledge Discovery and

ference on Natural Language Processing (EMNLP- Data Mining , 2016 , pp. 815 - 824 .

IJCNLP) , 2019 , pp. 1803 - 1813 . [18]

Zhou ,

W. X.

Zhao ,

Bian ,

Zhou ,

Wen ,

Yu , [8]

Kang ,

Balakrishnan ,

Shah ,

Crook , Y.-L. Improving conversational recommender systems

oriented dialogue , in: Proceedings of the 2019 Con- 26th ACM SIGKDD Conference on Knowledge Dis-

ference on Empirical Methods in Natural Language covery and Data Mining (KDD ), 2020 , pp. 1006 -

Processing and the 9th International Joint Con- 1014.

ference on Natural Language Processing (EMNLP- [19]

Sarkar ,

Goswami ,

Arcan ,

J. P.

McCrae , Sug-

IJCNLP) , 2019 , pp. 1951 - 1961 . gest me a movie for tonight: Leveraging knowl [9]

Jannach ,

Manzoor ,

Cai ,

Chen , A survey edge graphs for conversational recommendation,

on conversational recommender systems , ACM in: Proceedings of the 28th International Confer-

Computing

Surveys (CSUR) 54 ( 2021 ) 1 - 36 . ence on Computational Linguistics (COLING) , 2020 , [10]

Xu ,

Moon , H. Liu,

Liu ,

Shah ,

P. S.

Yu , User pp. 4179 - 4189 .

memory reasoning for conversational recommen - [20]

S. A.

Hayati ,

Kang ,

Zhu ,

Shi ,

Yu , IN-

dation, in: Proceedings of the 28th International SPIRED: Toward sociable recommendation dialog

Conference on Computational

Linguistics (COL- systems , in: Proceedings of the 2020 Conference on

ING) , 2020 , pp. 5288 - 5308 . Empirical Methods in Natural Language Processing [11]

Lei ,

He ,

Miao ,

Wu ,

Hong , M. Kan, (EMNLP) , 2020 , pp. 8142 - 8152 .

Chua , Estimation- action-reflection: Towards [21] W. Cai , L. Chen, Predicting user intents and sat-

ommendations, in: Proceedings of the 28th ACM

sonalization , 2020 , pp. 33 - 42 . [22]

Wu ,

Guo ,

Zhou ,

Wu ,

Zhang , R. Lian,

arXiv: 1906 . 05572 ( 2019 ). [23]

Zhou ,

Huang ,

Zhang ,

Zhu ,

Liu , Emo-

Intelligence , 2018 , pp. 730 - 739 . [24]

Cai ,

Wang ,

Bi ,

Tu ,

Liu , S. Shi,

ings of the 2019 Conference on Empirical Methods

in Natural Language Processing and the 9th Inter-

cessing (EMNLP-IJCNLP) , 2019 , pp. 1866 - 1875 . [25]

Zhou ,

Zheng ,

Huang ,

Zhu , Kd-

in: Proceedings of the 58th Annual Meeting of the

2020 , pp. 7098 - 7108 . [26]

Sutskever ,

Vinyals ,

Q. V.

Le , Sequence

Neural Information Processing Systems 27 : Annual

tems 2014 , 2014 , pp. 3104 - 3112 . [27]

Dong ,

Yang ,

Wang ,

Wei ,

Liu ,

Wang ,

cessing Systems 32: Annual Conference on Neural

Information Processing Systems (NeurIPS) , 2019 ,

pp. 13042 - 13054 . [28]

Radford , J. Wu ,

Child ,

Luan ,

Amodei ,

vised multitask learners , OpenAI blog 1 ( 2019 ) 9 . [29]

D. P.

Kingma ,

Ba , Adam: A method for stochastic

optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd

tions (ICLR) , 2015 . [30]

Sun ,

Wang ,

Li ,

Feng ,

Chen , H. Zhang,

tion , arXiv preprint arXiv: 1904 . 09223 ( 2019 ).