=Paper=
{{Paper
|id=Vol-2960/paper7
|storemode=property
|title=Target-guided Knowledge-aware Recommendation Dialogue System: An Empirical Investigation (Long paper)
|pdfUrl=https://ceur-ws.org/Vol-2960/paper7.pdf
|volume=Vol-2960
|authors=Dongding Lin,Jian Wang,Wenjie Li
|dblpUrl=https://dblp.org/rec/conf/recsys/LinWL21
}}
==Target-guided Knowledge-aware Recommendation Dialogue System: An Empirical Investigation (Long paper)==
Target-guided Knowledge-aware Recommendation Dialogue System: An Empirical Investigation Dongding Lin1 , Jian Wang1 and Wenjie Li1 1 Department of Computing, The Hong Kong Polytechnic University Abstract The target-guided recommendation dialogue system aims to make high-quality recommendations through interactive con- versations proactively and naturally. Existing methods still struggle to incorporate background knowledge for coherent response generation, and to recommend appropriate items with respect to dialogue context, user preference and recommen- dation target. In this paper, we investigate the problem of target-guided knowledge-aware recommendation dialogue and design a dialogue generation system to alleviate the above-mentioned issues. Specifically, we employ pre-trained language models with multi-task learning to jointly learn response generation and goal prediction towards the target. We also present a knowledge-preserving encoding strategy to maintain the facts in background knowledge. Extensive experiments on two benchmark datasets show that our system significantly outperforms various competitive models in terms of both automatic and manual evaluations. We further provide analysis and discussions to demonstrate that our system is effective in leverag- ing both related knowledge and planned goals to generate fluent, informative and coherent responses towards the target of recommendation. Keywords Recommendation Dialogue, Background Knowledge, Target Guiding, Multi-task Learning 1. Introduction knowledge helps a dialogue system better understand user interests and make recommendations via coherent Building a human-like dialogue system is one of the long- natural language communication. Since users may not cherished goals in natural language processing (NLP) have a clear preference for the unfamiliar new items [1]. Dialogue systems can be mainly used for chat- recommended, especially in many sociable recommenda- ting with users for entertainment, i.e., open-domain di- tion domains such as music, movies and news, it is also alogues [2, 3], or accomplishing specific tasks, i.e., task- important for a dialogue system to proactively lead the oriented dialogues [4, 5, 6]. Recent years, recommen- conversation to the recommendation target with high dation dialogue systems [7, 8] have been recognized as user engagement and enjoyment. an important special type of task-oriented dialogue sys- Recently, the emergence of the DuRecDial [15] dataset tems with the aim of discovering user preferences and provided new insights towards the development of target- making recommendations through conversations. The guided knowledge-aware recommendation dialogue sys- growing research interests mainly come from the bene- tems. As the example shown in Figure 1, the whole user- fits that dialogue provides an effective channel to han- bot dialogue is grounded on a user profile, background dle the cold-start problem in recommendations while knowledge, and a goal sequence. The bot needs to take recommendation-oriented tasks promotes technological both the user’s interests and the knowledge graph into advance for dialogue systems [9]. consideration to decide an optimal goal path to achieve Many existing methods have focused on various as- the target of recommending and playing a music (i.e, pects of both recommendation and conversation, includ- “Days of Friendship”). Here, the goal path is a sequence of ing user preference modeling [10], conversation strategy goals, with each goal specifying a goal type (e.g., “Greet- [11, 12], and dialogue generation [13, 14]. Today, incor- ing” or “Movie Recommendation”) and a goal topic (e.g., porating knowledge graphs (KG) has been recognized the movie “Orphans of the Zhao Family” or the music as crucial to bridge the gap between recommendation “Ode to Joy”). It is challenging to build such an effec- reasoning and response generation [13]. Background tive recommendation dialogue system, as the planning of goals which is required to ensure a natural and smooth 3rd Edition of Knowledge-aware and Conversational Recommender conversation must take into consideration of both the Systems (KaRS) & 5th Edition of Recommendation in Complex user interests and the ultimate target goal. The joint Environments (ComplexRec) Joint Workshop @ RecSys 2021, modeling of dialogue context, user profile and recom- September 27–1 October 2021, Amsterdam, Netherlands " csdlin@comp.polyu.edu.hk (D. Lin); mendation target remains under-explored in the existing csjiwang@comp.polyu.edu.hk (J. Wang); work. cswjli@comp.polyu.edu.hk (W. Li) In this paper, we investigate two key challenges in © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). recommendation dialogue systems: (1) how to effectively CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) User Profile Goal Conversation Name: Ren Xinfeng 1. Bot: Good evening, Ms. Ren Xinfeng. Gender: Female 2. User: Good evening. Greeting 3. Bot: How is it today? Age: 36-50 Stars user likes: Huang Xiaoming, Yang Mi 4. User: I feel like the best day of the month because I got my salary today! Movie user likes: Legend of the Grand Master 5. Bot: Haha, let’s watch a movie and relax. The recent one, "Orphans of the Zhao Family", Music user likes You Hate Me Like That. Movie Movie user reject: Four Seasons of the Earth. is very nice and has a good reputation. I recommend it to you. Recommendation 6. User: I don’t want to talk about movies now. Sorry. Knowledge Graph 7. Bot: Okay, how about its actor Huang Xiaoming? He’s a good actor and won the 12th Has a good reputation. China Film Society Of Performing Arts Award. Chat about 8. User: Really? I like him very much. comment 9. Bot: Many people like him. He is also the UNAIDS China Goodwill Ambassador. He is the Star 12th China Film very kind-hearted. UNAIDS China Orphans of Society Of Goodwill the Zhao 10. User: Really? He is the perfect idol in my mind. Performing Arts Ambassador Family 11. Bot: And Huang Xiaoming sings very well. This song "Ode to Joy" is recommended to Award award award you, which conveys a lot of positive energy. act 12. User: I'm not interested in this song, sorry. Chinese Film Huang sing Music 13. Bot: What about the "Days of Friendship" he sang? The song won the 19th top ten Ode to Joy Chinese Golden Melody in 1996. Media Awards Xiaoming Recommendation comment sing comment 14. User: Really? It sounds good. 15. Bot: Many people likes it. It’s very popular and worth listening to. As a fellow, I will 16. User: Yeah, I want to have a try. Days of It conveys a lot of always support Friendship positive energy. 17. Bot: Should I play it for you? you. comment 18. User: Yes, let’s play it. Play Music 19. Bot: Okay, it’s already playing. Please enjoy it. The 19th top ten 20. User: Wow, it’s really nice! The theme song of the movie Chinese Golden "Young and Dangerous" 21. Bot: Haha , then I won't disturb you listening to the song. Goodbye. Melody in 1996. Goodbye 22. User: Okay, goodbye. Figure 1: An illustrative example of target-guided knowledge-aware recommendation dialogue from the DuRecDial dataset. The whole dialogue is grounded on user profile, background knowledge, and a goal sequence, where the goal sequence is planned by the bot to achieve the target of recommending and playing a music. incorporate related facts in background knowledge and egy to better maintain background knowledge facts user profile in dialogue generation, and (2) how to make in order to enhance the system ability to generate appropriate goal planning to proactively lead the con- appropriate responses by incorporating background versation. To this end, we employ multi-task learning to knowledge. jointly learn response generation and goal prediction to- wards the final target based on pre-trained language mod- (3) The evaluation results show that our system achieves els. Specifically, we adopt ERNIE-GEN [16], an enhanced significant improvement compared to various com- multi-flow pre-training and fine-tuning framework for petitive models. natural language generation, as our backbone model. In addition, we also present a knowledge-preserving encod- 2. Related Work ing strategy to maintain the background knowledge facts for dialogue generation. Extensive experiments on two The two research lines that motivate our study are con- benchmark datasets show that our system significantly versational recommendation systems (CRS) and recom- outperforms various competitive models in terms of both mendation dialogue systems. We briefly introduce some automatic and manual evaluations. We have submitted representative works as below. our best model to Baidu Language and Intelligence Chal- lenge 2021 (LIC 20211 ), where we achieved the 4-th rank among 862 teams. It reveals that our methods are effec- 2.1. Conversational Recommendation tive to generate informative, coherent and appropriate System responses and to achieve the target of recommendation. A conversational recommendation system (CRS) is a rec- Overall, our contributions are three folds: ommendation system (RS) that provides personalized rec- (1) Towards building a target-guided recommendation di- ommendation through natural language conversations. alogue system, we adopt multi-task learning to jointly Christakopoulou et al. [17] argued that asking ques- model goal planning and dialogue generation based tions benefit a RS , which can better understand user on pre-trained language models. preference based on user feedback. To this end, they suggested to move from the traditional RS to the CRS. (2) We present a knowledge-preserving encoding strat- Lei et al. [11] proposed a three-stage framework called Estimation-Action-Reflection (EAR) to fill the interac- 1 https://aistudio.baidu.com/aistudio/competition/detail/67?is- tion gap between conversation and recommendation in FromLuge=true implicit ways. More explicitly, Lei et al. [12] leveraged set 𝐾𝑖 and each goal 𝑔𝑖,𝑗 consists of a goal type and a conversational recommendation as finding a path in a goal topic. 𝑃𝑖 = {𝑝𝑖,𝑗 }𝑁 𝑗=1 represents a set of user pro- 𝑃 user-item-attribute graph interactively. To enhance se- files with each profile 𝑝𝑖,𝑗 in the format of ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ mantic representations of products and related textual de- pair. 𝑌𝑖 = {𝑦𝑖,𝑗 }𝐿 𝑗=1 is the response produced on the 𝑌 scriptions of products, both Zhou et al. [18] and Rajdeep basis of the 𝐻𝑖 , 𝐾𝑖 , 𝐺𝑖 , and 𝑃𝑖 . Here, 𝐿𝐺 and 𝐿𝑌 denote et al. [19] incorporated external knowledge graphs (KG) the sequence length of 𝐺𝑖 and 𝑌𝑖 respectively. into CRS, which in turn led to better recommendations. Given explicit goals 𝐺′ = {𝑔1 , 𝑔𝐿𝐺 } (i.e., start goal However, despite the improvement towards high-quality and target goal), a dialogue history 𝐻 ′ paired with the recommendations, these methods have limited abilities related knowledge facts 𝐾 ′ , and the user profile 𝑃 ′ , the to generate natural and informative dialogues. objective of target-guided recommendation dialogue is to decide an appropriate goal 𝑔𝑐 at each turn to determine 2.2. Recommendation Dialogue System where the dialogue should go with the aim of proactively leading the dialogue from the start goal to the target goal, A recommendation dialogue system is a special type of and meanwhile generating a coherent and informative task-oriented dialogue system, which is expected to en- response to achieve the goal 𝑔𝑐 . courage natural human-machine interaction with a clear target. To facilitate the research along this line, several 3.2. Model Architecture recommendation dialogue datasets have been released, including GoRecDial [8] and INSPIRED [20]. To further Backbone Model To tackle the issue of proactively investigate whether the system can lead a multi-type di- planning goals for target-guided recommendation dia- alogue to approach the target of recommendation with logue, we jointly model goal prediction and dialogue rich interaction behavior, Liu et al. [15] created a large- generation based on pre-trained language models, as scale dialogue dataset, namely DuRecDial. Existing rec- shown in Figure 2 (a). Pre-trained language models have ommendation dialogue approaches mainly focus on how been widely used in dialogue generation on a basis of to effectively integrate interactive recommendation and pre-training fine-tuning framework, where they gener- dialogue generation. Cai et al. [21] contributed two hi- ally concatenate different information sources such as erarchical taxonomies for classifying user intents and knowledge facts and dialogue history as input, and gener- recommendation actions. To bridge the gap between rec- ate responses autoregressively. In this paper, we employ ommendation reasoning and response generation, Ma et ERNIE-GEN [16], an enhanced multi-flow pre-training al. [13] performed tree-structured reasoning on knowl- and fine-tuning framework for natural language genera- edge graphs, which can then be mapped to hierarchical tion, as our backbone framework. ERNIE-GEN bridges dialogue acts to guide generation. More recently, Bai the discrepancy between training and inference with an et al. [14] proposed a goal-oriented knowledge copy infilling generation mechanism using multi-flow atten- network to discern the knowledge facts that are highly tion. In light of the fact that entities and phrases are or- correlated to the dialogue, which assisted to generate ganized span by span, ERNIE-GEN adopts span attention accurate knowledge-aware responses. mask matrices (see Figure 2 (b)) to determine whether In this paper, we aim to build a target-guided dialogue each token and each span can attend to each other. To system towards recommendation. It requires the system better capture coherent semantic information of the con- to make high-quality recommendations by considering text, both word-by-word flow and span-by-span flow are external knowledge and user preference. More impor- integrated together (see Figure 2 (c)), where the the span- tantly, the system should also be able to lead the conver- by-span generation flow aims to predict semantically sation towards the target goal naturally by generating complete spans consecutively. In view of the fact that appropriate responses. specific entities or spans (e.g., musics, movies, and news) should be generated in the response as the recommended items, we believe ERNIE-GEN is a good choice with the 3. Method advantages described above. 3.1. Problem Definition Knowledge-preserving Encoding It is difficult for Suppose a target-guided dialogue corpus is denoted as existing pre-trained models to encode concatenated back- 𝑖=1 , where 𝐻𝑖 = {ℎ𝑖,𝑡 }𝑡=1 𝐷 = {(𝐻𝑖 , 𝐾𝑖 , 𝐺𝑖 , 𝑃𝑖 , 𝑌𝑖 )}𝑁 𝑇 ground knowledge facts because it often exceeds the represents dialogue history with multiple turns, 𝐾𝑖 = encoding length limitation of the models. In particular, 𝑗=1 is a set of background knowledge facts that {𝑘𝑖,𝑗 }𝑁 𝐾 according to our statistics, the concatenated background correspond to this conversation and each element 𝑘𝑖,𝑗 knowledge facts of each dialogue in the DuRecDial [15] is formulated as a triplet. 𝐺𝑖 = {𝑔𝑖,𝑗 }𝐿 𝑗=1 is a goal se- 𝐺 dataset contains more than 1,700 tokens on average. It quence which is constructed upon the knowledge facts substantially exceeds the encoding length limitation (i.e., Goal Type Response Goal Topic Feed Feed Infilling Decoding Y!"#$ Y%&'( 𝑂𝑢𝑡𝑝𝑢𝑡 Forward Forward 𝑇𝑜𝑘𝑒𝑛𝑠 Multi-Flow Attention in ERNIE-GEN 𝐿× Feed Forward Feed Forward Pre-trained Supervised Multi-Flow Attention Language Model Training Data Word-by-Word Flow (a) Overview of our system share Span-by-Span Contextual Flow Flow 𝑆𝑝𝑎𝑛 K, V 𝑡R 𝑡S 𝑡T 𝑡U K, V 𝑡R 𝑡S 𝑡T 𝑡U Q Q Contextual Flow 𝑡R 𝑡R 𝑡S 𝑡S 𝑆𝑝𝑎𝑛 𝑡T 𝑡T 𝐼𝑛𝑝𝑢𝑡 A) 𝑆 𝑇* A+ 𝑇𝑜𝑘𝑒𝑛𝑠 𝑡U 𝑡U + + + + 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 Can be attended Cannot be attended 𝑃, P% 𝑃, 𝑃, 𝑖𝑑𝑠 (b) Span attention mask matrices (c) Overview of Multi-Flow Attention Figure 2: Illustration of methodology. (a): Overview of our system, with goal type prediction, goal topic prediction, and response generation jointly modeled in a multi-task learning manner. (b): The span attention mask matrices used in our system. (c): Overview of Multi-Flow attention in ERNIE-GEN [16], which is employed as our backbone model. 512) of many pre-trained language models including Multi-task Learning As described in Section 3.1, the ERNIE-GEN [16]. To address this issue, we present a system should generate a coherent and informative re- knowledge-preserving encoding strategy to better main- sponse following an appropriate goal, which is decided by tain background knowledge facts. First, all knowledge the system itself at each turn with the aim of proactively triplets 𝐾𝑖 in 𝑖-th dialogue are sorted according to the to- leading the dialogue from the start goal to the target goal. ken length of each triplet after being concatenated. Then, Intuitively, the goal planning process has an important we put these knowledge triplets into a number of buckets effect on dialogue generation. To this end, we propose 𝑁𝐵 {𝐵𝑖,𝑗 }𝑗=1𝑖 with the short-length-first-in priority, where to add goal prediction at each turn as an auxiliary task, 𝐵𝑖,𝑗 ⊆ 𝐾𝑖 , 𝑁𝐵𝑖 denotes the number of buckets. The ca- which are jointly fine-tuned with the dialogue genera- pacity of each bucket 𝐵𝑖,𝑗 is tuned by a hyper-parameter tion task in a multi-task learning manner. Concretely, we 𝐶, which denotes that 𝐵𝑖,𝑗 contains no more than 𝐶 divide the task of goal prediction into two sub-tasks, goal tokens in total. We hope that after concatenating 𝐵𝑖,𝑗 type prediction and goal topic prediction. We feed the with other information sources, the total input length hidden representation of ERNIE-GEN’s encoding output fulfills the encoding length limitation (i.e., 512). To this to two individual fully-connected feed forward neural end, the 𝑖-th dialogue sample is split into 𝑁𝐵𝑖 dialogue networks, followed by a softmax operator, both of which samples. Note that the system will generate multiple are optimized using cross-entropy loss. As shown in Fig- responses during inference with this strategy. We adopt ure 2 (a), the fine-tuning objective during the training a simple unsupervised strategy to select the “best” one. stage is to jointly optimize the goal type prediction loss We calculate mutual F1 scores by treating one response ℒ𝑡𝑦𝑝𝑒 , goal topic prediction loss ℒ𝑡𝑜𝑝𝑖𝑐 , and the response as the “ground-truth” and the others as the candidate generation loss ℒ𝑔𝑒𝑛 . We minimize the following overall generated results. The average F1 score of the candidate loss: results will be regarded as the selection score for the ℒ = 𝛽1 ℒ𝑡𝑦𝑝𝑒 + 𝛽2 ℒ𝑡𝑜𝑝𝑖𝑐 + ℒ𝑔𝑒𝑛 (1) “ground-truth”. Therefore, each response will obtain a where 𝛽1 , 𝛽2 are two hyper-parameters controlling the corresponding selection score. We select the response impact of the goal type and the goal topic. Under the with the highest selection score as the final generated supervision of goal planning in the training stage, the sys- response. tem will learn to naturally generate coherent responses so as to achieve goal transition during the inference stage. 4. Experiments dialogue history, goal topics, knowledge triplets as well as the input sequence, we observe that it fulfills the maxi- 4.1. Datasets mum input length (i.e., 512) of ERNIE-GEN in most cases. For those few samples that exceed the length limitation, We conduct extensive experiments on two knowledge- we simply take the last 512 tokens (i.e., Chinese charac- aware recommendation dialogue datasets, i.e., DuConv ters) as input. For the DuRecDial dataset, it has an aver- [22] and DuRecDial [15] that are accompanied with the age of about 15 dialogue turns and an average of about explicitly specified goals. We also use some other dia- 22 knowledge triplets. After taking user profiles and goal logue datasets to enhance the fine-tuning process. All sequences into account, the average length of each con- datasets are in Chinese. catenated input sequence substantially exceeds the maxi- • DuConv: It consists of about 30k dialogues and 270k mum input length (i.e., 512) of ERNIE-GEN. Therefore, we utterances in the movie domain. Each dialogue con- adopt the knowledge-preserving encoding strategy de- tains about 14 background knowledge triplets on aver- scribed in Section 3.2 to better maintain the background age. The goal sequence of each dialogue is an explicit knowledge facts for each dialogue. path “[𝑠𝑡𝑎𝑟𝑡] → 𝑡𝑜𝑝𝑖𝑐_𝑎 → 𝑡𝑜𝑝𝑖𝑐_𝑏” over the knowl- edge graph, indicating how a dialogue is led from any 4.3. Baselines start point relevant to 𝑡𝑜𝑝𝑖𝑐_𝑎 to the final 𝑡𝑜𝑝𝑖𝑐_𝑏. We compare our system with baseline models and several Here, 𝑡𝑜𝑝𝑖𝑐 represents one entity in the background competitive methods as follows. knowledge. • Seq2Seq2 [26] is a generative baseline used in many • DuRecDial: It is composed of about 10k dialogues dialogue generation tasks. We concatenate dialogue and 156k utterances over multi-type domains, in- history, knowledge facts, and other sources (if any) to- cluding chit-chat, question answering (QA), and mu- gether as the input sequence and feed it to the vanilla sic/movie/news recommendation, etc. Each dialogue sequence-to-sequence (Seq2Seq) model with the atten- session consists of about 15 turns on average, with tion mechanism to generate responses. about 22 background knowledge triplets and a speci- fied user profile (e.g., age, gender, preference) in the • MGCG_R/G [15] include a retrieval-based model and format of ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs. The goal sequence is con- a generation-based model for multi-goal driven con- structed upon the knowledge and user profiles, with versation generation. They are presented as the second each goal containing a goal type and a goal topic (en- baseline on the DuRecDial dataset. tity). There are altogether 21 goal types. • UniLM [27] is a unified pre-trained language model • Other Datasets: Since it is important to select appro- that can be used for language generation by controlling priate entities or phrases from background knowledge generation with specific self-attention masks. facts for a recommendation dialogue system, we also • GPT-2 [28] is an autoregressive pre-trained language utilize additional large-scale dialogue datasets to help model and has been successfully used in many down- fine-tune our system because of their similar settings stream language generation tasks. The pre-training on for incorporating knowledge in dialogue generation. large-scale text corpora makes it easy to be fine-tuned The datasets include ESTC [23], Tencent [24], and Kd- for dialogue generation. Conv [25]. Both ESTC and Tencent datasets are col- 3 lected from open-domain conversations, with about • GOKC [14] is a generation-based model with a goal- 900k and 5.5M dialogues respectively. The KdConv oriented knowledge discernment mechanism, which dataset covers conversations about movie, music, and discerns the knowledge facts that are highly correlated tourism, which has more than 3k dialogues. We will to the dialogue goal and the dialogue context. Note discuss the effect of model performance with and with- that GOKC is the publicly available state-of-the-art out using these datasets in Section 5.1. model on both the DuConv dataset and the DuRecDial dataset. 4.2. Data Preprocessing 4.4. Implementation Details To better understand the characteristics of different Our dialogue system is built on top of the official datasets, we conduct data analysis and preprocessing open-source code of ERNIE-GEN4 . During training (fine- first. The statistics of DuConv and DuRecDial datasets tuning), both 𝛽1 and 𝛽2 are set to 1.0 and the batch size are reported in Table 1. For the DuConv dataset, it has 2 https://opennmt.net/OpenNMT-py/. an average of 4.5 dialogue turns and an average of 14.2 3 https://github.com/jq2276/Learning2Copy knowledge triplets. After concatenating the multi-turn 4 https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen Table 1 The statistics of DuConv dataset and DuRecDial dataset Goals History Response KB triplets #Dialogue Avg. size Max. turn Avg. turn Max. length Avg. length Max. length Avg. length Max. size Avg. size Train 17,858 2 9 4.5 354 17.1 130 21.3 23 14.2 DuConv Dev 2,000 2 9 4.5 77 17.0 77 21.2 21 14.2 Test 2,000 2 9 4.5 100 17.1 77 22.3 21 14.2 Train 6,018 4.5 28 15.2 255 16.1 85 22.3 71 21.4 DuRecDial Dev 600 4.5 26 15.2 250 16.1 79 22.3 56 22.4 Test 946 4.6 26 15.3 245 16.1 68 22.3 57 21.7 is set to 8. We use Adam [29] optimizer with the initial • BLEU-1/2 scores: They are also calculated at the char- learning rate of 1 × 10−4 , the 𝐿2 weight decay of 0.01 acter level, representing 1-gram and 2-gram overlaps and the learning rate warm-up over the first 10% train- between the generated response and the gold response. ing steps with linear decay. During generation, we adopt beam search decoding algorithm with a beam size of 5. • Distinct (DIST)-1/2 scores: They are used to evalu- The details are described below. ate the 1-gram diversity and 2-gram diversity of the generated response. Fine-tuning We start fine-tuning from the pre-trained • Perplexity (PPL): It is widely used to estimate how Chinese version of ERNIE 1.0 model [30], as it is compat- well a probability model predicts a sample. A low ible with the ERNIE-GEN framework and its pre-trained perplexity indicates the model is good at predicting model checkpoint can be directly loaded. We first fine- the sample. tune our system on 3 large-scale dialogue datasets (as described in Section 4.1) for 5 epochs. Due to the large Human Evaluation With human evaluation, we ran- size of the ESTC dataset and the Tencent dataset, we domly select 100 dialogue samples from the testset, and randomly extract 400K dialogue samples from each origi- then invite 5 evaluators to independently assign the rat- nal dataset. We continue to fine-tune our system on the ing score for the output of each model following the met- target dialogue datasets (DuConv and DuRecDial) for 10 rics suggested in [15]. The score of each metric is ranged epochs, with the bucket capacity 𝐶 setting to 360. from 0 to 2. Furthermore, we also report the human eval- uation in the Baidu LIC 2021, where crowed-sourcing Vocabulary Expansion We find that ERNIE-GEN annotators are invited to conduct about 10 multi-round may generate unknown words (i.e., [UNK]), i.e., the conversations with each submitted system and to judge words out of the vocabulary. Therefore, we add the ad- the dialogue quality. The metrics used in our evaluation ditional tokens with high occurrence extracted from the and in Baidu LIC 2021 are in consistent, including: datasets to expand the original vocabulary. The final vo- cabulary size is 18,000, which can cover almost all the • Informativeness (Info.): It measures if the model Chinese characters and common special tokens in the makes full use of knowledge facts in the generated datasets. response. • Coherence (Cohe.): It measures the overall fluency Deduplication We observe that our system tends to of the whole dialogue generation. generate repeated words or phrases sometimes, which is a common issue that is still under exploration in natural • Knowledge accuracy (Know Acc.): It evaluates the language generation. To make the generated response accuracy of the selected knowledge in the generated look more fluent, we remove the consecutive repeated response. words using regular expression rules. • Recommendation success rate (Rec. Succ.): It es- timates how well the target recommendation goal is 4.5. Evaluation Metrics achieved. Automatic Evaluation Following the common prac- tice [15, 14], we adopt the following automatic evaluation metrics. 5. Results and Analysis • F1 score: It indicates whether the model can generate 5.1. Automatic Evaluation appropriate entities in the response. The automatic evaluation results on the DuConv dataset and the DuRecDial dataset are reported in Table 2. Table 2 Table 3 The automatic evaluation results on the DuConv dataset and The human evaluation results on the DuConv dataset and the the DuRecDial dataset. The norm and ext stand for normal- DuRecDial dataset. ized and external dialogue datasets respectively. Model Info. Cohe. Know Acc. Rec. Succ. Model F1 BLEU-1/2 DIST-1/2 PPL DuConv DuConv norm retrieval 0.605 0.767 0.677 0.71 norm generation 0.881 0.819 0.904 0.85 norm retrieval 34.73 0.291/0.156 0.118/0.373 N.A. UniLM 0.909 1.022 0.921 0.93 Seq2Seq 39.94 0.283/0.186 0.093/0.222 10.96 GPT-2 0.911 1.216 0.934 0.94 norm generation 41.84 0.347/0.198 0.057/0.155 24.30 GOKC 1.052 1.308 1.108 1.12 UniLM 41.10 0.326/0.213 0.089/0.241 10.05 Ours (w/o ext) 1.060 1.436 1.127 1.26 GPT-2 41.05 0.349/0.223 0.095/0.250 8.72 Ours (w/ ext) 1.081 1.456 1.213 1.31 GOKC 45.09 0.410/0.272 0.105/0.272 9.92 DuRecDial Ours (w/o ext) 45.15 0.443/0.361 0.097/0.276 7.81 MGCG_R 1.118 1.260 0.955 0.75 Ours (w/ ext) 45.72 0.450/0.369 0.106/0.285 6.49 MGCG_G 1.218 1.222 0.985 0.88 DuRecDial UniLM 0.989 1.078 0.852 1.0 GPT-2 1.289 1.256 1.165 1.20 Seq2Seq 26.08 0.188/0.102 0.006/0.013 22.82 GOKC 1.311 1.258 1.189 1.23 UniLM 29.05 0.226/0.149 0.030/0.096 18.65 Ours (w/o ext) 1.358 1.262 1.286 1.21 MGCG_R 33.93 0.340/0.232 0.068/0.187 N.A. Ours (w/ ext) 1.430 1.262 1.295 1.23 MGCG_G 36.81 0.323/0.219 0.017/0.052 17.69 Ours (multi-task) 1.433 1.268 1.315 1.32 GPT-2 47.01 0.392/0.295 0.055/0.165 15.56 GOKC 47.28 0.413/0.318 0.025/0.084 11.38 Ours (w/o ext) 48.62 0.475/0.401 0.060/0.168 4.42 Ours (w/ ext) 48.73 0.476/0.403 0.065/0.171 4.21 improvements on F1, BLEU-1, and BLEU-2 evaluation Ours (multi-task) 48.80 0.479/0.408 0.064/0.166 4.22 metrics. When using external dialogue datasets, our Ours (full-goal) 48.86 0.479/0.410 0.070/0.185 4.20 model achieves about 3%, 15.3%, and 26.7% improvements accordingly. Our model with multi-task learning fur- ther outperforms baseline methods on all metrics, which Our model outperforms all the compared models, and demonstrates that joint modeling of goal planning and achieves a significant improvement over most of the eval- dialogue generation is effective to help the system select uation metrics. Specifically, on the DuConv dataset, the the appropriate knowledge from the background facts to normalized models (i.e., norm retrieval and norm gen- facilitate generation. Besides, we also observe that the eration) refer to using normalized data by replacing the perplexity of our model is much lower, indicating that specific two goals in the knowledge path with “topic_a” our model is more likely to generate fluent responses. It and “topic_b” respectively, following [22]. As shown in should be noted that based on its default setting, GOKC Table 2, our model yields substantial improvement over actually assumes that the full goal sequence is provided existing pre-trained models including UniLM and GPT-2 and thus does not require any goal planning [15]. There- on both F1 and BLEU-1/2. It demonstrates that our model fore, for fair comparison we also report our evaluation can generate more coherent and informative responses results using the available full goal sequence in Table 2. in the 𝑛-gram’s level. Compared to the state-of-the-art The results further show the effectiveness of multi-task model GOKC, our model without using external dialogue learning for our system. Overall, our model achieves datasets (w/o ext) still achieves about 0.13%, 8%, and 32% significant improvements over competitive methods in improvements in terms of F1, BLEU-1, and BLEU-2, re- terms of all automatic evaluation metrics. spectively. After using external dialogue datasets (w/ ext), our model further achieves 1.39%, 9.7%, and 35.6% improvements of F1, BLEU-1, and BLEU-2 compared to 5.2. Human Evaluation GOKC, which indicates that fine-tuning on large-scale The human evaluation results of baseline models and task relevant dialogue datasets is effective to improve our model are presented in Table 3. As shown in Ta- the performance in the final target-guided knowledge- ble 3, our model obtains the highest human scores on aware recommendation dialogue task. Note that the nor- both the DuConv and DuRecDial datasets, which shows malized retrieval method achieves the highest DIST-1/2 the effectiveness to generate informative and coherent scores. The retrieval-based methods that directly select responses with correct knowledge and consistent infor- responses from a list of candidates is more likely to retain mation. Specifically, our model achieves significant im- the diversity of the natural responses. provement in terms of knowledge accuracy, which fur- As shown in Table 2, our model also achieves superior ther verifies that fine-tuning on large-scale task relevant performance than all baseline methods on the DuRec- dialogue datasets is effective to improve the ability of Dial dataset. In particular, compared to the competi- our model to incorporate knowledge into generation. We tive model GOKC, our model without using external observe that our model with multi-task learning obtains dialogue datasets obtains about 2.8%, 15%, and 26.1% Table 4 Table 5 The human evaluation results on the Baidu LIC 2021. The experimental results of different implementation details. Rank Info. Cohe. Know Acc. Rec. Succ. Our Model F1 BLEU-1/2 DIST-1/2 PPL DuConv DuConv Team-1 1.023 1.469 1.215 N.A. Base 43.47 0.412/0.294 0.097/0.265 7.12 Team-2 1.109 1.572 1.312 N.A. + Deduplication 43.60 0.415/0.298 0.099/0.269 7.12 Team-3 1.198 1.649 1.412 N.A. + knowledge-preserving enc. 44.23 0.422/0.336 0.104/0.276 6.98 Team-4 (Ours) 1.081 1.456 1.213 N.A. + Vocabulary expansion 44.50 0.435/0.347 0.105/0.279 6.65 Team-5 0.881 1.266 0.944 N.A. + All 45.72 0.450/0.369 0.106/0.285 6.49 Team-6 1.088 1.153 1.176 N.A. DuRecDial DuRecDial Base 45.97 0.447/0.389 0.053/0.159 4.55 Team-1 0.481 1.322 0.795 2.0 + Deduplication 46.02 0.449/0.387 0.057/0.161 4.55 Team-2 0.426 1.09 0.641 1.633 + knowledge-preserving enc. 47.65 0.458/0.397 0.065/0.164 4.40 Team-3 0.444 1.131 0.664 1.5 + Vocabulary expansion 48.27 0.463/0.405 0.067/0.179 4.34 Team-4 (Ours) 0.431 1.26 0.706 1.0 + All 48.86 0.479/0.410 0.070/0.185 4.20 Team-5 0.48 1.246 0.754 1.467 Team-6 0.349 1.027 0.533 1.167 rics, which means that the vocabulary containing more much better recommendation success rate on the DuRec- common tokens is important in dialogue generation. Dial dataset. This verifies that our joint modeling of goal prediction and response generation enables the system Future Research Direction In real world, recom- to make more accurate recommendations with respect mending new target items that possibly attract users is to the given goals and the user profile. meaningful since users often have no definite preference We submitted our best model to the Baidu LIC 2021 for many unknown items. We are trying to achieve this and achieved the 4-th rank among 862 teams. The human objective through the development of the target-guided evaluation results on the leaderboard are shown in Table knowledge-aware recommendation dialogue system. We 4. Note that the human evaluation here is more challeng- understand that it is not sufficient by simply modeling ing due to two aspects: (1) The decision of the current the target and dialogue with multi-task learning as in- goal relies on the previously predicted goals, and (2) the vestigated in this paper. We will leave the problem of generation of the response at the current turn will be proactively planning goals step by step towards the target further decided by the current goal. It is likely to cause goal as our future research direction. error accumulations for a model during multi-turn con- versations. Therefore, the evaluation results can better reveal the abilities of different models to guide the con- 6. Conclusion versation to the target. As shown in Table 4, our system In this paper, we explore target-guided knowledge-aware is competitive compared to others. However, our model recommendation dialogue based on the pre-training fine- performs inferior on the DuRecDial dataset in terms of tuning framework, which aims to proactively lead the the recommendation success rate. It encourages us to conversation and learn to make high-quality recommen- further improve goal planning strategies in future work. dations. We present a knowledge-preserving encoding strategy and a multi-task learning approach to enable our 5.3. Discussion system to effectively recommend appropriate items and to generate fluent and coherent responses. The experi- Analysis of Implementation Details We study the mental results on two benchmark datasets demonstrate contribution of each part in our system by conducting the effectiveness and superiority of our system compared experiments with several variants of our system. The to the other competitive models in terms of both au- results are shown in Table 5. Here is our findings. (1) All tomatic and manual evaluations. We also discuss the strategies are effective to improve dialogue generation implementation details and our future research direction. performance. (2) The knowledge-preserving encoding strategy contributes significantly to dialogue generation especially on the DuRecDial dataset where the input Acknowledgments source sequence is much longer. Compared to previous methods that truncate tokens when the sequence exceeds The work described in this paper was supported by Re- the encoding length limitation of pre-trained models, our search Grants Council of Hong Kong (PolyU/15207920, proposed encoding strategy better maintains the knowl- PolyU/15207821), National Natural Science Foundation edge facts. (3) After expanding the vocabulary, our model of China (61672445, 62076212) and PolyU Internal Grants achieves significant improvements over most of the met- (ZVVX, ZG7H, ZVQ0). References ommender systems, in: J. Caverlee, X. B. Hu, M. Lal- mas, W. Wang (Eds.), The Thirteenth ACM Interna- [1] A. M. Turing, Computing machinery and intelli- tional Conference on Web Search and Data Mining gence, in: Parsing the turing test, 2009, pp. 23–65. (WSDM), 2020, pp. 304–312. [2] W. Chen, J. Chen, P. Qin, X. Yan, W. Y. Wang, Se- [12] W. Lei, G. Zhang, X. He, Y. Miao, X. Wang, L. Chen, mantically conditioned dialog response generation T. Chua, Interactive path reasoning on graph for via hierarchical disentangled self-attention, in: Pro- conversational recommendation, in: R. Gupta, ceedings of the 57th Annual Meeting of the As- Y. Liu, J. Tang, B. A. Prakash (Eds.), The 26th ACM sociation for Computational Linguistics, 2019, pp. SIGKDD Conference on Knowledge Discovery and 3696–3709. Data Mining (KDD), 2020, pp. 2073–2083. [3] Y. Wu, F. Wei, S. Huang, Y. Wang, Z. Li, M. Zhou, [13] W. Ma, R. Takanobu, M. Tu, M. Huang, Bridg- Response generation by context-aware prototype ing the gap between conversational reasoning editing, in: The Thirty-Third AAAI Conference on and interactive recommendation, arXiv preprint Artificial Intelligence, 2019, pp. 7281–7288. arXiv:2010.10333 (2020). [4] M. Eric, L. Krishnan, F. Charette, C. D. Manning, [14] J. Bai, Z. Yang, X. Liang, W. Wang, Z. Li, Learning Key-value retrieval networks for task-oriented di- to copy coherent knowledge for response genera- alogue, in: Proceedings of the 18th Annual SIG tion, in: Proceedings of the AAAI Conference on dial Meeting on Discourse and Dialogue, 2017, pp. Artificial Intelligence, 2021, pp. 12535–12543. 37–49. [15] Z. Liu, H. Wang, Z.-Y. Niu, H. Wu, W. Che, T. Liu, To- [5] A. Madotto, C.-S. Wu, P. Fung, Mem2Seq: Effec- wards conversational recommendation over multi- tively incorporating knowledge bases into end-to- type dialogs, in: Proceedings of the 58th Annual end task-oriented dialog systems, in: Proceedings Meeting of the Association for Computational Lin- of the 56th Annual Meeting of the Association for guistics (ACL), 2020, pp. 1036–1049. Computational Linguistics (ACL) (Volume 1: Long [16] D. Xiao, H. Zhang, Y. Li, Y. Sun, H. Tian, H. Wu, Papers), 2018, pp. 1468–1478. H. Wang, ERNIE-GEN: an enhanced multi-flow [6] C. Wu, R. Socher, C. Xiong, Global-to-local memory pre-training and fine-tuning framework for natural pointer networks for task-oriented dialogue, in: 7th language generation, in: C. Bessiere (Ed.), Pro- International Conference on Learning Representa- ceedings of the Twenty-Ninth International Joint tions (ICLR), 2019. Conference on Artificial Intelligence (IJCAI), 2020, [7] Q. Chen, J. Lin, Y. Zhang, M. Ding, Y. Cen, H. Yang, pp. 3997–4003. J. Tang, Towards knowledge-based recommender [17] K. Christakopoulou, F. Radlinski, K. Hofmann, To- dialog system, in: Proceedings of the 2019 Confer- wards conversational recommender systems, in: ence on Empirical Methods in Natural Language Proceedings of the 22nd ACM SIGKDD Interna- Processing and the 9th International Joint Con- tional Conference on Knowledge Discovery and ference on Natural Language Processing (EMNLP- Data Mining, 2016, pp. 815–824. IJCNLP), 2019, pp. 1803–1813. [18] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J. Wen, J. Yu, [8] D. Kang, A. Balakrishnan, P. Shah, P. Crook, Y.-L. Improving conversational recommender systems Boureau, J. Weston, Recommendation as a commu- via knowledge graph based semantic fusion, in: nication game: Self-supervised bot-play for goal- R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), The oriented dialogue, in: Proceedings of the 2019 Con- 26th ACM SIGKDD Conference on Knowledge Dis- ference on Empirical Methods in Natural Language covery and Data Mining (KDD), 2020, pp. 1006– Processing and the 9th International Joint Con- 1014. ference on Natural Language Processing (EMNLP- [19] R. Sarkar, K. Goswami, M. Arcan, J. P. McCrae, Sug- IJCNLP), 2019, pp. 1951–1961. gest me a movie for tonight: Leveraging knowl- [9] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey edge graphs for conversational recommendation, on conversational recommender systems, ACM in: Proceedings of the 28th International Confer- Computing Surveys (CSUR) 54 (2021) 1–36. ence on Computational Linguistics (COLING), 2020, [10] H. Xu, S. Moon, H. Liu, B. Liu, P. Shah, P. S. Yu, User pp. 4179–4189. memory reasoning for conversational recommen- [20] S. A. Hayati, D. Kang, Q. Zhu, W. Shi, Z. Yu, IN- dation, in: Proceedings of the 28th International SPIRED: Toward sociable recommendation dialog Conference on Computational Linguistics (COL- systems, in: Proceedings of the 2020 Conference on ING), 2020, pp. 5288–5308. Empirical Methods in Natural Language Processing [11] W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M. Kan, (EMNLP), 2020, pp. 8142–8152. T. Chua, Estimation-action-reflection: Towards [21] W. Cai, L. Chen, Predicting user intents and sat- deep interaction between conversational and rec- isfaction with dialogue-based conversational rec- ommendations, in: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Per- sonalization, 2020, pp. 33–42. [22] W. Wu, Z. Guo, X. Zhou, H. Wu, X. Zhang, R. Lian, H. Wang, Proactive human-machine conversation with explicit conversation goals, arXiv preprint arXiv:1906.05572 (2019). [23] H. Zhou, M. Huang, T. Zhang, X. Zhu, B. Liu, Emo- tional chatting machine: Emotional conversation generation with internal and external memory, in: S. A. McIlraith, K. Q. Weinberger (Eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 730–739. [24] D. Cai, Y. Wang, W. Bi, Z. Tu, X. Liu, S. Shi, Retrieval-guided dialogue response generation via a matching-to-generation framework, in: Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), 2019, pp. 1866–1875. [25] H. Zhou, C. Zheng, K. Huang, M. Huang, X. Zhu, Kd- Conv: A Chinese multi-domain dialogue dataset to- wards multi-turn knowledge-driven conversation, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 7098–7108. [26] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Sys- tems 2014, 2014, pp. 3104–3112. [27] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, H. Hon, Unified language model pre-training for natural language understanding and generation, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Gar- nett (Eds.), Advances in Neural Information Pro- cessing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), 2019, pp. 13042–13054. [28] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsuper- vised multitask learners, OpenAI blog 1 (2019) 9. [29] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representa- tions (ICLR), 2015. [30] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, H. Wu, Ernie: En- hanced representation through knowledge integra- tion, arXiv preprint arXiv:1904.09223 (2019).