=Paper=
{{Paper
|id=Vol-2960/paper7
|storemode=property
|title=Target-guided Knowledge-aware Recommendation Dialogue System: An Empirical Investigation (Long paper)
|pdfUrl=https://ceur-ws.org/Vol-2960/paper7.pdf
|volume=Vol-2960
|authors=Dongding Lin,Jian Wang,Wenjie Li
|dblpUrl=https://dblp.org/rec/conf/recsys/LinWL21
}}
==Target-guided Knowledge-aware Recommendation Dialogue System: An Empirical Investigation (Long paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2960/paper7.pdf</pdf>
<pre>
Target-guided Knowledge-aware Recommendation
Dialogue System: An Empirical Investigation
Dongding Lin1 , Jian Wang1 and Wenjie Li1
1
    Department of Computing, The Hong Kong Polytechnic University


                                             Abstract
                                             The target-guided recommendation dialogue system aims to make high-quality recommendations through interactive con-
                                             versations proactively and naturally. Existing methods still struggle to incorporate background knowledge for coherent
                                             response generation, and to recommend appropriate items with respect to dialogue context, user preference and recommen-
                                             dation target. In this paper, we investigate the problem of target-guided knowledge-aware recommendation dialogue and
                                             design a dialogue generation system to alleviate the above-mentioned issues. Specifically, we employ pre-trained language
                                             models with multi-task learning to jointly learn response generation and goal prediction towards the target. We also present
                                             a knowledge-preserving encoding strategy to maintain the facts in background knowledge. Extensive experiments on two
                                             benchmark datasets show that our system significantly outperforms various competitive models in terms of both automatic
                                             and manual evaluations. We further provide analysis and discussions to demonstrate that our system is effective in leverag-
                                             ing both related knowledge and planned goals to generate fluent, informative and coherent responses towards the target of
                                             recommendation.

                                             Keywords
                                             Recommendation Dialogue, Background Knowledge, Target Guiding, Multi-task Learning


1. Introduction                                                                                                       knowledge helps a dialogue system better understand
                                                                                                                      user interests and make recommendations via coherent
Building a human-like dialogue system is one of the long-                                                             natural language communication. Since users may not
cherished goals in natural language processing (NLP)                                                                  have a clear preference for the unfamiliar new items
[1]. Dialogue systems can be mainly used for chat-                                                                    recommended, especially in many sociable recommenda-
ting with users for entertainment, i.e., open-domain di-                                                              tion domains such as music, movies and news, it is also
alogues [2, 3], or accomplishing specific tasks, i.e., task-                                                          important for a dialogue system to proactively lead the
oriented dialogues [4, 5, 6]. Recent years, recommen-                                                                 conversation to the recommendation target with high
dation dialogue systems [7, 8] have been recognized as                                                                user engagement and enjoyment.
an important special type of task-oriented dialogue sys-                                                                 Recently, the emergence of the DuRecDial [15] dataset
tems with the aim of discovering user preferences and                                                                 provided new insights towards the development of target-
making recommendations through conversations. The                                                                     guided knowledge-aware recommendation dialogue sys-
growing research interests mainly come from the bene-                                                                 tems. As the example shown in Figure 1, the whole user-
fits that dialogue provides an effective channel to han-                                                              bot dialogue is grounded on a user profile, background
dle the cold-start problem in recommendations while                                                                   knowledge, and a goal sequence. The bot needs to take
recommendation-oriented tasks promotes technological                                                                  both the user’s interests and the knowledge graph into
advance for dialogue systems [9].                                                                                     consideration to decide an optimal goal path to achieve
   Many existing methods have focused on various as-                                                                  the target of recommending and playing a music (i.e,
pects of both recommendation and conversation, includ-                                                                “Days of Friendship”). Here, the goal path is a sequence of
ing user preference modeling [10], conversation strategy                                                              goals, with each goal specifying a goal type (e.g., “Greet-
[11, 12], and dialogue generation [13, 14]. Today, incor-                                                             ing” or “Movie Recommendation”) and a goal topic (e.g.,
porating knowledge graphs (KG) has been recognized                                                                    the movie “Orphans of the Zhao Family” or the music
as crucial to bridge the gap between recommendation                                                                   “Ode to Joy”). It is challenging to build such an effec-
reasoning and response generation [13]. Background                                                                    tive recommendation dialogue system, as the planning of
                                                                                                                      goals which is required to ensure a natural and smooth
3rd Edition of Knowledge-aware and Conversational Recommender                                                         conversation must take into consideration of both the
Systems (KaRS) & 5th Edition of Recommendation in Complex                                                             user interests and the ultimate target goal. The joint
Environments (ComplexRec) Joint Workshop @ RecSys 2021,
                                                                                                                      modeling of dialogue context, user profile and recom-
September 27–1 October 2021, Amsterdam, Netherlands
" csdlin@comp.polyu.edu.hk (D. Lin);                                                                                  mendation target remains under-explored in the existing
csjiwang@comp.polyu.edu.hk (J. Wang);                                                                                 work.
cswjli@comp.polyu.edu.hk (W. Li)                                                                                         In this paper, we investigate two key challenges in
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     recommendation dialogue systems: (1) how to effectively
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                            User Profile                                   Goal                                                   Conversation
 Name:                       Ren Xinfeng                                                1. Bot:     Good evening, Ms. Ren Xinfeng.
 Gender:                     Female                                                     2. User:    Good evening.
                                                                          Greeting      3. Bot:     How is it today?
 Age:                        36-50
 Stars user likes:           Huang Xiaoming, Yang Mi                                    4. User:    I feel like the best day of the month because I got my salary today!
 Movie user likes:           Legend of the Grand Master
                                                                                        5. Bot:     Haha, let’s watch a movie and relax. The recent one, "Orphans of the Zhao Family",
 Music user likes            You Hate Me Like That.                        Movie
 Movie user reject:          Four Seasons of the Earth.                                             is very nice and has a good reputation. I recommend it to you.
                                                                       Recommendation   6. User:    I don’t want to talk about movies now. Sorry.
                        Knowledge Graph
                                                                                        7. Bot:   Okay, how about its actor Huang Xiaoming? He’s a good actor and won the 12th
                        Has a good reputation.                                                    China Film Society Of Performing Arts Award.
                                                                         Chat about     8. User:  Really? I like him very much.
                                   comment                                              9. Bot:   Many people like him. He is also the UNAIDS China Goodwill Ambassador. He is
                                                                          the Star
                                                  12th China Film                                 very kind-hearted.
   UNAIDS China              Orphans of             Society Of
     Goodwill                 the Zhao                                                  10. User: Really? He is the perfect idol in my mind.
                                                  Performing Arts
    Ambassador                 Family                                                   11. Bot:  And Huang Xiaoming sings very well. This song "Ode to Joy" is recommended to
                                                       Award
                  award                          award                                            you, which conveys a lot of positive energy.
                              act                                                       12. User: I'm not interested in this song, sorry.
    Chinese Film          Huang            sing                            Music        13. Bot: What about the "Days of Friendship" he sang? The song won the 19th top ten
                                                     Ode to Joy                                   Chinese Golden Melody in 1996.
    Media Awards         Xiaoming                                      Recommendation
                 comment      sing                        comment                       14. User: Really? It sounds good.
                                                                                        15. Bot: Many people likes it. It’s very popular and worth listening to.
  As a fellow, I will                                                                   16. User: Yeah, I want to have a try.
                              Days of            It conveys a lot of
   always support
                             Friendship           positive energy.                      17. Bot:    Should I play it for you?
         you.
                                    comment                                             18. User:   Yes, let’s play it.
                                                                         Play Music     19. Bot:    Okay, it’s already playing. Please enjoy it.
                                                  The 19th top ten                      20. User:   Wow, it’s really nice!
     The theme song of the movie
                                                  Chinese Golden
       "Young and Dangerous"                                                            21. Bot: Haha , then I won't disturb you listening to the song. Goodbye.
                                                  Melody in 1996.         Goodbye
                                                                                        22. User: Okay, goodbye.


Figure 1: An illustrative example of target-guided knowledge-aware recommendation dialogue from the DuRecDial dataset.
The whole dialogue is grounded on user profile, background knowledge, and a goal sequence, where the goal sequence is
planned by the bot to achieve the target of recommending and playing a music.


incorporate related facts in background knowledge and         egy to better maintain background knowledge facts
user profile in dialogue generation, and (2) how to make      in order to enhance the system ability to generate
appropriate goal planning to proactively lead the con-        appropriate responses by incorporating background
versation. To this end, we employ multi-task learning to      knowledge.
jointly learn response generation and goal prediction to-
wards the final target based on pre-trained language mod- (3) The evaluation results show that our system achieves
els. Specifically, we adopt ERNIE-GEN [16], an enhanced       significant improvement compared to various com-
multi-flow pre-training and fine-tuning framework for         petitive models.
natural language generation, as our backbone model. In
addition, we also present a knowledge-preserving encod- 2. Related Work
ing strategy to maintain the background knowledge facts
for dialogue generation. Extensive experiments on two The two research lines that motivate our study are con-
benchmark datasets show that our system significantly versational recommendation systems (CRS) and recom-
outperforms various competitive models in terms of both mendation dialogue systems. We briefly introduce some
automatic and manual evaluations. We have submitted representative works as below.
our best model to Baidu Language and Intelligence Chal-
lenge 2021 (LIC 20211 ), where we achieved the 4-th rank
among 862 teams. It reveals that our methods are effec-
                                                           2.1. Conversational Recommendation
tive to generate informative, coherent and appropriate           System
responses and to achieve the target of recommendation. A conversational recommendation system (CRS) is a rec-
   Overall, our contributions are three folds:             ommendation system (RS) that provides personalized rec-
(1) Towards building a target-guided recommendation di-             ommendation through natural language conversations.
    alogue system, we adopt multi-task learning to jointly          Christakopoulou et al. [17] argued that asking ques-
    model goal planning and dialogue generation based               tions benefit a RS , which can better understand user
    on pre-trained language models.                                 preference based on user feedback. To this end, they
                                                                    suggested to move from the traditional RS to the CRS.
(2) We present a knowledge-preserving encoding strat- Lei et al. [11] proposed a three-stage framework called
                                                                    Estimation-Action-Reflection (EAR) to fill the interac-
    1
      https://aistudio.baidu.com/aistudio/competition/detail/67?is- tion gap between conversation and recommendation in
FromLuge=true
implicit ways. More explicitly, Lei et al. [12] leveraged   set 𝐾𝑖 and each goal 𝑔𝑖,𝑗 consists of a goal type and a
conversational recommendation as finding a path in a        goal topic. 𝑃𝑖 = {𝑝𝑖,𝑗 }𝑁   𝑗=1 represents a set of user pro-
                                                                                         𝑃

user-item-attribute graph interactively. To enhance se-     files with each profile 𝑝𝑖,𝑗 in the format of ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩
mantic representations of products and related textual de-  pair. 𝑌𝑖 = {𝑦𝑖,𝑗 }𝐿 𝑗=1 is the response produced on the
                                                                                   𝑌

scriptions of products, both Zhou et al. [18] and Rajdeep   basis of the 𝐻𝑖 , 𝐾𝑖 , 𝐺𝑖 , and 𝑃𝑖 . Here, 𝐿𝐺 and 𝐿𝑌 denote
et al. [19] incorporated external knowledge graphs (KG)     the sequence length of 𝐺𝑖 and 𝑌𝑖 respectively.
into CRS, which in turn led to better recommendations.         Given explicit goals 𝐺′ = {𝑔1 , 𝑔𝐿𝐺 } (i.e., start goal
However, despite the improvement towards high-quality       and target goal), a dialogue history 𝐻 ′ paired with the
recommendations, these methods have limited abilities       related knowledge facts 𝐾 ′ , and the user profile 𝑃 ′ , the
to generate natural and informative dialogues.              objective of target-guided recommendation dialogue is to
                                                            decide an appropriate goal 𝑔𝑐 at each turn to determine
2.2. Recommendation Dialogue System                         where  the dialogue should go with the aim of proactively
                                                            leading the dialogue from the start goal to the target goal,
A recommendation dialogue system is a special type of and meanwhile generating a coherent and informative
task-oriented dialogue system, which is expected to en- response to achieve the goal 𝑔𝑐 .
courage natural human-machine interaction with a clear
target. To facilitate the research along this line, several
                                                            3.2. Model Architecture
recommendation dialogue datasets have been released,
including GoRecDial [8] and INSPIRED [20]. To further Backbone Model To tackle the issue of proactively
investigate whether the system can lead a multi-type di- planning goals for target-guided recommendation dia-
alogue to approach the target of recommendation with logue, we jointly model goal prediction and dialogue
rich interaction behavior, Liu et al. [15] created a large- generation based on pre-trained language models, as
scale dialogue dataset, namely DuRecDial. Existing rec- shown in Figure 2 (a). Pre-trained language models have
ommendation dialogue approaches mainly focus on how been widely used in dialogue generation on a basis of
to effectively integrate interactive recommendation and pre-training fine-tuning framework, where they gener-
dialogue generation. Cai et al. [21] contributed two hi- ally concatenate different information sources such as
erarchical taxonomies for classifying user intents and knowledge facts and dialogue history as input, and gener-
recommendation actions. To bridge the gap between rec- ate responses autoregressively. In this paper, we employ
ommendation reasoning and response generation, Ma et ERNIE-GEN [16], an enhanced multi-flow pre-training
al. [13] performed tree-structured reasoning on knowl- and fine-tuning framework for natural language genera-
edge graphs, which can then be mapped to hierarchical tion, as our backbone framework. ERNIE-GEN bridges
dialogue acts to guide generation. More recently, Bai the discrepancy between training and inference with an
et al. [14] proposed a goal-oriented knowledge copy infilling generation mechanism using multi-flow atten-
network to discern the knowledge facts that are highly tion. In light of the fact that entities and phrases are or-
correlated to the dialogue, which assisted to generate ganized span by span, ERNIE-GEN adopts span attention
accurate knowledge-aware responses.                         mask matrices (see Figure 2 (b)) to determine whether
   In this paper, we aim to build a target-guided dialogue each token and each span can attend to each other. To
system towards recommendation. It requires the system better capture coherent semantic information of the con-
to make high-quality recommendations by considering text, both word-by-word flow and span-by-span flow are
external knowledge and user preference. More impor- integrated together (see Figure 2 (c)), where the the span-
tantly, the system should also be able to lead the conver- by-span generation flow aims to predict semantically
sation towards the target goal naturally by generating complete spans consecutively. In view of the fact that
appropriate responses.                                      specific entities or spans (e.g., musics, movies, and news)
                                                            should be generated in the response as the recommended
                                                            items, we believe ERNIE-GEN is a good choice with the
3. Method                                                   advantages described above.

3.1. Problem Definition                                        Knowledge-preserving Encoding It is difficult for
Suppose a target-guided dialogue corpus is denoted as          existing pre-trained models to encode concatenated back-
                               𝑖=1 , where 𝐻𝑖 = {ℎ𝑖,𝑡 }𝑡=1
𝐷 = {(𝐻𝑖 , 𝐾𝑖 , 𝐺𝑖 , 𝑃𝑖 , 𝑌𝑖 )}𝑁                        𝑇
                                                               ground knowledge facts because it often exceeds the
represents dialogue history with multiple turns, 𝐾𝑖 =          encoding length limitation of the models. In particular,
       𝑗=1 is a set of background knowledge facts that
{𝑘𝑖,𝑗 }𝑁 𝐾                                                     according to our statistics, the concatenated background
correspond to this conversation and each element 𝑘𝑖,𝑗          knowledge facts of each dialogue in the DuRecDial [15]
is formulated as a triplet. 𝐺𝑖 = {𝑔𝑖,𝑗 }𝐿  𝑗=1 is a goal se-
                                            𝐺                  dataset contains more than 1,700 tokens on average. It
quence which is constructed upon the knowledge facts           substantially exceeds the encoding length limitation (i.e.,
                      Goal Type              Response                     Goal Topic


                        Feed                                                Feed
                                         Infilling Decoding                                             Y!"#$                        Y%&'(         𝑂𝑢𝑡𝑝𝑢𝑡
                       Forward                                             Forward
                                                                                                                                                   𝑇𝑜𝑘𝑒𝑛𝑠


                                       Multi-Flow Attention
                                            in ERNIE-GEN                                  𝐿×
                                                                                                   Feed Forward                Feed Forward


                             Pre-trained                        Supervised                                  Multi-Flow Attention
                           Language Model                      Training Data
                                                                                                   Word-by-Word
                                                                                                       Flow
                             (a) Overview of our system
                                                                                                                      share        Span-by-Span
                                                                                                   Contextual Flow
                                                                                                                                       Flow
                                                                      𝑆𝑝𝑎𝑛
                 K, V 𝑡R    𝑡S    𝑡T   𝑡U                 K, V 𝑡R    𝑡S      𝑡T   𝑡U
             Q                                     Q
                                                                                                                               Contextual Flow
                 𝑡R                                       𝑡R

                 𝑡S                                       𝑡S
                                                   𝑆𝑝𝑎𝑛


                 𝑡T                                       𝑡T                                                                                       𝐼𝑛𝑝𝑢𝑡
                                                                                                   A)            𝑆            𝑇*             A+    𝑇𝑜𝑘𝑒𝑛𝑠
                 𝑡U                                       𝑡U
                                                                                                    +             +           +              +
                                                                                                                                                  𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙
                           Can be attended                          Cannot be attended             𝑃,            P%           𝑃,             𝑃,     𝑖𝑑𝑠


                           (b) Span attention mask matrices                                    (c) Overview of Multi-Flow Attention


Figure 2: Illustration of methodology. (a): Overview of our system, with goal type prediction, goal topic prediction, and
response generation jointly modeled in a multi-task learning manner. (b): The span attention mask matrices used in our
system. (c): Overview of Multi-Flow attention in ERNIE-GEN [16], which is employed as our backbone model.


512) of many pre-trained language models including                                       Multi-task Learning As described in Section 3.1, the
ERNIE-GEN [16]. To address this issue, we present a                                      system should generate a coherent and informative re-
knowledge-preserving encoding strategy to better main-                                   sponse following an appropriate goal, which is decided by
tain background knowledge facts. First, all knowledge                                    the system itself at each turn with the aim of proactively
triplets 𝐾𝑖 in 𝑖-th dialogue are sorted according to the to-                             leading the dialogue from the start goal to the target goal.
ken length of each triplet after being concatenated. Then,                               Intuitively, the goal planning process has an important
we put these knowledge triplets into a number of buckets                                 effect on dialogue generation. To this end, we propose
         𝑁𝐵
{𝐵𝑖,𝑗 }𝑗=1𝑖 with the short-length-first-in priority, where                               to add goal prediction at each turn as an auxiliary task,
𝐵𝑖,𝑗 ⊆ 𝐾𝑖 , 𝑁𝐵𝑖 denotes the number of buckets. The ca-                                   which are jointly fine-tuned with the dialogue genera-
pacity of each bucket 𝐵𝑖,𝑗 is tuned by a hyper-parameter                                 tion task in a multi-task learning manner. Concretely, we
𝐶, which denotes that 𝐵𝑖,𝑗 contains no more than 𝐶                                       divide the task of goal prediction into two sub-tasks, goal
tokens in total. We hope that after concatenating 𝐵𝑖,𝑗                                   type prediction and goal topic prediction. We feed the
with other information sources, the total input length                                   hidden representation of ERNIE-GEN’s encoding output
fulfills the encoding length limitation (i.e., 512). To this                             to two individual fully-connected feed forward neural
end, the 𝑖-th dialogue sample is split into 𝑁𝐵𝑖 dialogue                                 networks, followed by a softmax operator, both of which
samples. Note that the system will generate multiple                                     are optimized using cross-entropy loss. As shown in Fig-
responses during inference with this strategy. We adopt                                  ure 2 (a), the fine-tuning objective during the training
a simple unsupervised strategy to select the “best” one.                                 stage is to jointly optimize the goal type prediction loss
We calculate mutual F1 scores by treating one response                                   ℒ𝑡𝑦𝑝𝑒 , goal topic prediction loss ℒ𝑡𝑜𝑝𝑖𝑐 , and the response
as the “ground-truth” and the others as the candidate                                    generation loss ℒ𝑔𝑒𝑛 . We minimize the following overall
generated results. The average F1 score of the candidate                                 loss:
results will be regarded as the selection score for the                                              ℒ = 𝛽1 ℒ𝑡𝑦𝑝𝑒 + 𝛽2 ℒ𝑡𝑜𝑝𝑖𝑐 + ℒ𝑔𝑒𝑛               (1)
“ground-truth”. Therefore, each response will obtain a                                   where 𝛽1 , 𝛽2 are two hyper-parameters controlling the
corresponding selection score. We select the response                                    impact of the goal type and the goal topic. Under the
with the highest selection score as the final generated                                  supervision of goal planning in the training stage, the sys-
response.                                                                                tem will learn to naturally generate coherent responses
                                                                                         so as to achieve goal transition during the inference stage.
4. Experiments                                             dialogue history, goal topics, knowledge triplets as well
                                                           as the input sequence, we observe that it fulfills the maxi-
4.1. Datasets                                              mum input length (i.e., 512) of ERNIE-GEN in most cases.
                                                           For those few samples that exceed the length limitation,
We conduct extensive experiments on two knowledge-
                                                           we simply take the last 512 tokens (i.e., Chinese charac-
aware recommendation dialogue datasets, i.e., DuConv
                                                           ters) as input. For the DuRecDial dataset, it has an aver-
[22] and DuRecDial [15] that are accompanied with the
                                                           age of about 15 dialogue turns and an average of about
explicitly specified goals. We also use some other dia-
                                                           22 knowledge triplets. After taking user profiles and goal
logue datasets to enhance the fine-tuning process. All
                                                           sequences into account, the average length of each con-
datasets are in Chinese.
                                                           catenated input sequence substantially exceeds the maxi-
• DuConv: It consists of about 30k dialogues and 270k      mum input length (i.e., 512) of ERNIE-GEN. Therefore, we
  utterances in the movie domain. Each dialogue con-       adopt the knowledge-preserving encoding strategy de-
  tains about 14 background knowledge triplets on aver-    scribed in Section 3.2 to better maintain the background
  age. The goal sequence of each dialogue is an explicit   knowledge facts for each dialogue.
  path “[𝑠𝑡𝑎𝑟𝑡] → 𝑡𝑜𝑝𝑖𝑐_𝑎 → 𝑡𝑜𝑝𝑖𝑐_𝑏” over the knowl-
  edge graph, indicating how a dialogue is led from any 4.3. Baselines
  start point relevant to 𝑡𝑜𝑝𝑖𝑐_𝑎 to the final 𝑡𝑜𝑝𝑖𝑐_𝑏.
                                                           We compare our system with baseline models and several
  Here, 𝑡𝑜𝑝𝑖𝑐 represents one entity in the background
                                                           competitive methods as follows.
  knowledge.
                                                           • Seq2Seq2 [26] is a generative baseline used in many
• DuRecDial: It is composed of about 10k dialogues           dialogue generation tasks. We concatenate dialogue
  and 156k utterances over multi-type domains, in-           history, knowledge facts, and other sources (if any) to-
  cluding chit-chat, question answering (QA), and mu-        gether as the input sequence and feed it to the vanilla
  sic/movie/news recommendation, etc. Each dialogue          sequence-to-sequence (Seq2Seq) model with the atten-
  session consists of about 15 turns on average, with        tion mechanism to generate responses.
  about 22 background knowledge triplets and a speci-
  fied user profile (e.g., age, gender, preference) in the • MGCG_R/G [15] include a retrieval-based model and
  format of ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs. The goal sequence is con-    a generation-based model for multi-goal driven con-
  structed upon the knowledge and user profiles, with        versation generation. They are presented as the second
  each goal containing a goal type and a goal topic (en-     baseline on the DuRecDial dataset.
  tity). There are altogether 21 goal types.               • UniLM [27] is a unified pre-trained language model
• Other Datasets: Since it is important to select appro-   that can be used for language generation by controlling
  priate entities or phrases from background knowledge     generation with specific self-attention masks.
  facts for a recommendation dialogue system, we also • GPT-2 [28] is an autoregressive pre-trained language
  utilize additional large-scale dialogue datasets to help model and has been successfully used in many down-
  fine-tune our system because of their similar settings   stream language generation tasks. The pre-training on
  for incorporating knowledge in dialogue generation.      large-scale text corpora makes it easy to be fine-tuned
  The datasets include ESTC [23], Tencent [24], and Kd-    for dialogue generation.
  Conv [25]. Both ESTC and Tencent datasets are col-
                                                                   3
  lected from open-domain conversations, with about • GOKC [14] is a generation-based model with a goal-
  900k and 5.5M dialogues respectively. The KdConv         oriented knowledge discernment mechanism, which
  dataset covers conversations about movie, music, and     discerns the knowledge facts that are highly correlated
  tourism, which has more than 3k dialogues. We will       to the dialogue goal and the dialogue context. Note
  discuss the effect of model performance with and with-   that GOKC is the publicly available state-of-the-art
  out using these datasets in Section 5.1.                 model on both the DuConv dataset and the DuRecDial
                                                           dataset.

4.2. Data Preprocessing                                    4.4. Implementation Details
To better understand the characteristics of different      Our dialogue system is built on top of the official
datasets, we conduct data analysis and preprocessing       open-source code of ERNIE-GEN4 . During training (fine-
first. The statistics of DuConv and DuRecDial datasets     tuning), both 𝛽1 and 𝛽2 are set to 1.0 and the batch size
are reported in Table 1. For the DuConv dataset, it has        2
                                                                   https://opennmt.net/OpenNMT-py/.
an average of 4.5 dialogue turns and an average of 14.2        3
                                                                   https://github.com/jq2276/Learning2Copy
knowledge triplets. After concatenating the multi-turn         4
                                                                   https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-gen
Table 1
The statistics of DuConv dataset and DuRecDial dataset
                                  Goals                              History                                  Response                KB triplets
                     #Dialogue   Avg. size   Max. turn   Avg. turn      Max. length   Avg. length   Max. length    Avg. length   Max. size   Avg. size
             Train     17,858       2            9          4.5                354       17.1           130              21.3       23          14.2
  DuConv      Dev       2,000       2            9          4.5                77        17.0           77               21.2       21          14.2
              Test      2,000       2            9          4.5                100       17.1           77               22.3       21          14.2
             Train     6,018        4.5         28         15.2                255       16.1           85               22.3       71          21.4
 DuRecDial    Dev       600         4.5         26         15.2                250       16.1           79               22.3       56          22.4
              Test      946         4.6         26         15.3                245       16.1           68               22.3       57          21.7


is set to 8. We use Adam [29] optimizer with the initial • BLEU-1/2 scores: They are also calculated at the char-
learning rate of 1 × 10−4 , the 𝐿2 weight decay of 0.01    acter level, representing 1-gram and 2-gram overlaps
and the learning rate warm-up over the first 10% train-    between the generated response and the gold response.
ing steps with linear decay. During generation, we adopt
beam search decoding algorithm with a beam size of 5. • Distinct (DIST)-1/2 scores: They are used to evalu-
The details are described below.                           ate the 1-gram diversity and 2-gram diversity of the
                                                           generated response.
Fine-tuning We start fine-tuning from the pre-trained                          • Perplexity (PPL): It is widely used to estimate how
Chinese version of ERNIE 1.0 model [30], as it is compat-                        well a probability model predicts a sample. A low
ible with the ERNIE-GEN framework and its pre-trained                            perplexity indicates the model is good at predicting
model checkpoint can be directly loaded. We first fine-                          the sample.
tune our system on 3 large-scale dialogue datasets (as
described in Section 4.1) for 5 epochs. Due to the large  Human Evaluation With human evaluation, we ran-
size of the ESTC dataset and the Tencent dataset, we      domly select 100 dialogue samples from the testset, and
randomly extract 400K dialogue samples from each origi-   then invite 5 evaluators to independently assign the rat-
nal dataset. We continue to fine-tune our system on the   ing score for the output of each model following the met-
target dialogue datasets (DuConv and DuRecDial) for 10    rics suggested in [15]. The score of each metric is ranged
epochs, with the bucket capacity 𝐶 setting to 360.        from 0 to 2. Furthermore, we also report the human eval-
                                                          uation in the Baidu LIC 2021, where crowed-sourcing
Vocabulary Expansion We find that ERNIE-GEN annotators are invited to conduct about 10 multi-round
may generate unknown words (i.e., [UNK]), i.e., the conversations with each submitted system and to judge
words out of the vocabulary. Therefore, we add the ad- the dialogue quality. The metrics used in our evaluation
ditional tokens with high occurrence extracted from the and in Baidu LIC 2021 are in consistent, including:
datasets to expand the original vocabulary. The final vo-
cabulary size is 18,000, which can cover almost all the • Informativeness (Info.): It measures if the model
Chinese characters and common special tokens in the          makes full use of knowledge facts in the generated
datasets.                                                    response.
                                                                               • Coherence (Cohe.): It measures the overall fluency
Deduplication We observe that our system tends to                                of the whole dialogue generation.
generate repeated words or phrases sometimes, which is
a common issue that is still under exploration in natural                      • Knowledge accuracy (Know Acc.): It evaluates the
language generation. To make the generated response                              accuracy of the selected knowledge in the generated
look more fluent, we remove the consecutive repeated                             response.
words using regular expression rules.
                                                                               • Recommendation success rate (Rec. Succ.): It es-
                                                                                 timates how well the target recommendation goal is
4.5. Evaluation Metrics                                                          achieved.
Automatic Evaluation Following the common prac-
tice [15, 14], we adopt the following automatic evaluation
metrics.
                                                                               5. Results and Analysis
• F1 score: It indicates whether the model can generate                        5.1. Automatic Evaluation
  appropriate entities in the response.                                        The automatic evaluation results on the DuConv dataset
                                                                               and the DuRecDial dataset are reported in Table 2.
Table 2                                                              Table 3
The automatic evaluation results on the DuConv dataset and           The human evaluation results on the DuConv dataset and the
the DuRecDial dataset. The norm and ext stand for normal-            DuRecDial dataset.
ized and external dialogue datasets respectively.                             Model          Info.      Cohe.    Know Acc.   Rec. Succ.

        Model           F1       BLEU-1/2       DIST-1/2     PPL                                      DuConv

                                DuConv                                     norm retrieval    0.605      0.767      0.677        0.71
                                                                          norm generation    0.881      0.819      0.904        0.85
    norm retrieval     34.73     0.291/0.156   0.118/0.373   N.A.             UniLM          0.909      1.022      0.921        0.93
      Seq2Seq          39.94     0.283/0.186   0.093/0.222   10.96            GPT-2          0.911      1.216      0.934        0.94
   norm generation     41.84     0.347/0.198   0.057/0.155   24.30            GOKC           1.052      1.308      1.108        1.12
       UniLM           41.10     0.326/0.213   0.089/0.241   10.05         Ours (w/o ext)    1.060      1.436      1.127        1.26
       GPT-2           41.05     0.349/0.223   0.095/0.250   8.72          Ours (w/ ext)     1.081      1.456      1.213        1.31
       GOKC            45.09     0.410/0.272   0.105/0.272   9.92
                                                                                                     DuRecDial
    Ours (w/o ext)     45.15     0.443/0.361   0.097/0.276   7.81            MGCG_R          1.118      1.260      0.955        0.75
    Ours (w/ ext)      45.72     0.450/0.369   0.106/0.285   6.49            MGCG_G          1.218      1.222      0.985        0.88
                               DuRecDial                                      UniLM          0.989      1.078      0.852        1.0
                                                                              GPT-2          1.289      1.256      1.165        1.20
      Seq2Seq          26.08     0.188/0.102   0.006/0.013   22.82            GOKC           1.311      1.258      1.189        1.23
       UniLM           29.05     0.226/0.149   0.030/0.096   18.65
                                                                          Ours (w/o ext)     1.358      1.262      1.286        1.21
      MGCG_R           33.93     0.340/0.232   0.068/0.187   N.A.          Ours (w/ ext)     1.430      1.262      1.295        1.23
      MGCG_G           36.81     0.323/0.219   0.017/0.052   17.69       Ours (multi-task)   1.433      1.268      1.315        1.32
       GPT-2           47.01     0.392/0.295   0.055/0.165   15.56
       GOKC            47.28     0.413/0.318   0.025/0.084   11.38
    Ours (w/o ext)     48.62     0.475/0.401   0.060/0.168   4.42
    Ours (w/ ext)      48.73     0.476/0.403   0.065/0.171   4.21    improvements on F1, BLEU-1, and BLEU-2 evaluation
   Ours (multi-task)   48.80     0.479/0.408   0.064/0.166   4.22    metrics. When using external dialogue datasets, our
   Ours (full-goal)    48.86     0.479/0.410   0.070/0.185   4.20
                                                                     model achieves about 3%, 15.3%, and 26.7% improvements
                                                                     accordingly. Our model with multi-task learning fur-
                                                                     ther outperforms baseline methods on all metrics, which
Our model outperforms all the compared models, and                   demonstrates that joint modeling of goal planning and
achieves a significant improvement over most of the eval-            dialogue generation is effective to help the system select
uation metrics. Specifically, on the DuConv dataset, the             the appropriate knowledge from the background facts to
normalized models (i.e., norm retrieval and norm gen-                facilitate generation. Besides, we also observe that the
eration) refer to using normalized data by replacing the             perplexity of our model is much lower, indicating that
specific two goals in the knowledge path with “topic_a”              our model is more likely to generate fluent responses. It
and “topic_b” respectively, following [22]. As shown in              should be noted that based on its default setting, GOKC
Table 2, our model yields substantial improvement over               actually assumes that the full goal sequence is provided
existing pre-trained models including UniLM and GPT-2                and thus does not require any goal planning [15]. There-
on both F1 and BLEU-1/2. It demonstrates that our model              fore, for fair comparison we also report our evaluation
can generate more coherent and informative responses                 results using the available full goal sequence in Table 2.
in the 𝑛-gram’s level. Compared to the state-of-the-art              The results further show the effectiveness of multi-task
model GOKC, our model without using external dialogue                learning for our system. Overall, our model achieves
datasets (w/o ext) still achieves about 0.13%, 8%, and 32%           significant improvements over competitive methods in
improvements in terms of F1, BLEU-1, and BLEU-2, re-                 terms of all automatic evaluation metrics.
spectively. After using external dialogue datasets (w/
ext), our model further achieves 1.39%, 9.7%, and 35.6%
improvements of F1, BLEU-1, and BLEU-2 compared to                   5.2. Human Evaluation
GOKC, which indicates that fine-tuning on large-scale                The human evaluation results of baseline models and
task relevant dialogue datasets is effective to improve              our model are presented in Table 3. As shown in Ta-
the performance in the final target-guided knowledge-                ble 3, our model obtains the highest human scores on
aware recommendation dialogue task. Note that the nor-               both the DuConv and DuRecDial datasets, which shows
malized retrieval method achieves the highest DIST-1/2               the effectiveness to generate informative and coherent
scores. The retrieval-based methods that directly select             responses with correct knowledge and consistent infor-
responses from a list of candidates is more likely to retain         mation. Specifically, our model achieves significant im-
the diversity of the natural responses.                              provement in terms of knowledge accuracy, which fur-
   As shown in Table 2, our model also achieves superior             ther verifies that fine-tuning on large-scale task relevant
performance than all baseline methods on the DuRec-                  dialogue datasets is effective to improve the ability of
Dial dataset. In particular, compared to the competi-                our model to incorporate knowledge into generation. We
tive model GOKC, our model without using external                    observe that our model with multi-task learning obtains
dialogue datasets obtains about 2.8%, 15%, and 26.1%
Table 4                                                        Table 5
The human evaluation results on the Baidu LIC 2021.            The experimental results of different implementation details.
        Rank        Info.    Cohe.    Know Acc.   Rec. Succ.            Our Model              F1     BLEU-1/2       DIST-1/2     PPL
                             DuConv                                                            DuConv
       Team-1       1.023    1.469      1.215        N.A.       Base                          43.47   0.412/0.294   0.097/0.265   7.12
       Team-2       1.109    1.572      1.312        N.A.       + Deduplication               43.60   0.415/0.298   0.099/0.269   7.12
       Team-3       1.198    1.649      1.412        N.A.       + knowledge-preserving enc.   44.23   0.422/0.336   0.104/0.276   6.98
    Team-4 (Ours)   1.081    1.456      1.213        N.A.       + Vocabulary expansion        44.50   0.435/0.347   0.105/0.279   6.65
       Team-5       0.881    1.266      0.944        N.A.       + All                         45.72   0.450/0.369   0.106/0.285   6.49
       Team-6       1.088    1.153      1.176        N.A.
                                                                                              DuRecDial
                            DuRecDial
                                                                Base                          45.97   0.447/0.389   0.053/0.159   4.55
       Team-1       0.481    1.322      0.795        2.0        + Deduplication               46.02   0.449/0.387   0.057/0.161   4.55
       Team-2       0.426     1.09      0.641       1.633       + knowledge-preserving enc.   47.65   0.458/0.397   0.065/0.164   4.40
       Team-3       0.444    1.131      0.664        1.5        + Vocabulary expansion        48.27   0.463/0.405   0.067/0.179   4.34
    Team-4 (Ours)   0.431     1.26      0.706        1.0        + All                         48.86   0.479/0.410   0.070/0.185   4.20
       Team-5       0.48     1.246      0.754       1.467
       Team-6       0.349    1.027      0.533       1.167

                                                               rics, which means that the vocabulary containing more
much better recommendation success rate on the DuRec-          common tokens is important in dialogue generation.
Dial dataset. This verifies that our joint modeling of goal
prediction and response generation enables the system          Future Research Direction In real world, recom-
to make more accurate recommendations with respect             mending new target items that possibly attract users is
to the given goals and the user profile.                       meaningful since users often have no definite preference
   We submitted our best model to the Baidu LIC 2021           for many unknown items. We are trying to achieve this
and achieved the 4-th rank among 862 teams. The human          objective through the development of the target-guided
evaluation results on the leaderboard are shown in Table       knowledge-aware recommendation dialogue system. We
4. Note that the human evaluation here is more challeng-       understand that it is not sufficient by simply modeling
ing due to two aspects: (1) The decision of the current        the target and dialogue with multi-task learning as in-
goal relies on the previously predicted goals, and (2) the     vestigated in this paper. We will leave the problem of
generation of the response at the current turn will be         proactively planning goals step by step towards the target
further decided by the current goal. It is likely to cause     goal as our future research direction.
error accumulations for a model during multi-turn con-
versations. Therefore, the evaluation results can better
reveal the abilities of different models to guide the con-
                                                               6. Conclusion
versation to the target. As shown in Table 4, our system       In this paper, we explore target-guided knowledge-aware
is competitive compared to others. However, our model          recommendation dialogue based on the pre-training fine-
performs inferior on the DuRecDial dataset in terms of         tuning framework, which aims to proactively lead the
the recommendation success rate. It encourages us to           conversation and learn to make high-quality recommen-
further improve goal planning strategies in future work.       dations. We present a knowledge-preserving encoding
                                                               strategy and a multi-task learning approach to enable our
5.3. Discussion                                                system to effectively recommend appropriate items and
                                                               to generate fluent and coherent responses. The experi-
Analysis of Implementation Details We study the
                                                               mental results on two benchmark datasets demonstrate
contribution of each part in our system by conducting
                                                               the effectiveness and superiority of our system compared
experiments with several variants of our system. The
                                                               to the other competitive models in terms of both au-
results are shown in Table 5. Here is our findings. (1) All
                                                               tomatic and manual evaluations. We also discuss the
strategies are effective to improve dialogue generation
                                                               implementation details and our future research direction.
performance. (2) The knowledge-preserving encoding
strategy contributes significantly to dialogue generation
especially on the DuRecDial dataset where the input            Acknowledgments
source sequence is much longer. Compared to previous
methods that truncate tokens when the sequence exceeds         The work described in this paper was supported by Re-
the encoding length limitation of pre-trained models, our      search Grants Council of Hong Kong (PolyU/15207920,
proposed encoding strategy better maintains the knowl-         PolyU/15207821), National Natural Science Foundation
edge facts. (3) After expanding the vocabulary, our model      of China (61672445, 62076212) and PolyU Internal Grants
achieves significant improvements over most of the met-        (ZVVX, ZG7H, ZVQ0).
References                                                          ommender systems, in: J. Caverlee, X. B. Hu, M. Lal-
                                                                    mas, W. Wang (Eds.), The Thirteenth ACM Interna-
 [1] A. M. Turing, Computing machinery and intelli-                 tional Conference on Web Search and Data Mining
     gence, in: Parsing the turing test, 2009, pp. 23–65.           (WSDM), 2020, pp. 304–312.
 [2] W. Chen, J. Chen, P. Qin, X. Yan, W. Y. Wang, Se-         [12] W. Lei, G. Zhang, X. He, Y. Miao, X. Wang, L. Chen,
     mantically conditioned dialog response generation              T. Chua, Interactive path reasoning on graph for
     via hierarchical disentangled self-attention, in: Pro-         conversational recommendation, in: R. Gupta,
     ceedings of the 57th Annual Meeting of the As-                 Y. Liu, J. Tang, B. A. Prakash (Eds.), The 26th ACM
     sociation for Computational Linguistics, 2019, pp.             SIGKDD Conference on Knowledge Discovery and
     3696–3709.                                                     Data Mining (KDD), 2020, pp. 2073–2083.
 [3] Y. Wu, F. Wei, S. Huang, Y. Wang, Z. Li, M. Zhou,         [13] W. Ma, R. Takanobu, M. Tu, M. Huang, Bridg-
     Response generation by context-aware prototype                 ing the gap between conversational reasoning
     editing, in: The Thirty-Third AAAI Conference on               and interactive recommendation, arXiv preprint
     Artificial Intelligence, 2019, pp. 7281–7288.                  arXiv:2010.10333 (2020).
 [4] M. Eric, L. Krishnan, F. Charette, C. D. Manning,         [14] J. Bai, Z. Yang, X. Liang, W. Wang, Z. Li, Learning
     Key-value retrieval networks for task-oriented di-             to copy coherent knowledge for response genera-
     alogue, in: Proceedings of the 18th Annual SIG                 tion, in: Proceedings of the AAAI Conference on
     dial Meeting on Discourse and Dialogue, 2017, pp.              Artificial Intelligence, 2021, pp. 12535–12543.
     37–49.                                                    [15] Z. Liu, H. Wang, Z.-Y. Niu, H. Wu, W. Che, T. Liu, To-
 [5] A. Madotto, C.-S. Wu, P. Fung, Mem2Seq: Effec-                 wards conversational recommendation over multi-
     tively incorporating knowledge bases into end-to-              type dialogs, in: Proceedings of the 58th Annual
     end task-oriented dialog systems, in: Proceedings              Meeting of the Association for Computational Lin-
     of the 56th Annual Meeting of the Association for              guistics (ACL), 2020, pp. 1036–1049.
     Computational Linguistics (ACL) (Volume 1: Long           [16] D. Xiao, H. Zhang, Y. Li, Y. Sun, H. Tian, H. Wu,
     Papers), 2018, pp. 1468–1478.                                  H. Wang, ERNIE-GEN: an enhanced multi-flow
 [6] C. Wu, R. Socher, C. Xiong, Global-to-local memory             pre-training and fine-tuning framework for natural
     pointer networks for task-oriented dialogue, in: 7th           language generation, in: C. Bessiere (Ed.), Pro-
     International Conference on Learning Representa-               ceedings of the Twenty-Ninth International Joint
     tions (ICLR), 2019.                                            Conference on Artificial Intelligence (IJCAI), 2020,
 [7] Q. Chen, J. Lin, Y. Zhang, M. Ding, Y. Cen, H. Yang,           pp. 3997–4003.
     J. Tang, Towards knowledge-based recommender              [17] K. Christakopoulou, F. Radlinski, K. Hofmann, To-
     dialog system, in: Proceedings of the 2019 Confer-             wards conversational recommender systems, in:
     ence on Empirical Methods in Natural Language                  Proceedings of the 22nd ACM SIGKDD Interna-
     Processing and the 9th International Joint Con-                tional Conference on Knowledge Discovery and
     ference on Natural Language Processing (EMNLP-                 Data Mining, 2016, pp. 815–824.
     IJCNLP), 2019, pp. 1803–1813.                             [18] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J. Wen, J. Yu,
 [8] D. Kang, A. Balakrishnan, P. Shah, P. Crook, Y.-L.             Improving conversational recommender systems
     Boureau, J. Weston, Recommendation as a commu-                 via knowledge graph based semantic fusion, in:
     nication game: Self-supervised bot-play for goal-              R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), The
     oriented dialogue, in: Proceedings of the 2019 Con-            26th ACM SIGKDD Conference on Knowledge Dis-
     ference on Empirical Methods in Natural Language               covery and Data Mining (KDD), 2020, pp. 1006–
     Processing and the 9th International Joint Con-                1014.
     ference on Natural Language Processing (EMNLP-            [19] R. Sarkar, K. Goswami, M. Arcan, J. P. McCrae, Sug-
     IJCNLP), 2019, pp. 1951–1961.                                  gest me a movie for tonight: Leveraging knowl-
 [9] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey              edge graphs for conversational recommendation,
     on conversational recommender systems, ACM                     in: Proceedings of the 28th International Confer-
     Computing Surveys (CSUR) 54 (2021) 1–36.                       ence on Computational Linguistics (COLING), 2020,
[10] H. Xu, S. Moon, H. Liu, B. Liu, P. Shah, P. S. Yu, User        pp. 4179–4189.
     memory reasoning for conversational recommen-             [20] S. A. Hayati, D. Kang, Q. Zhu, W. Shi, Z. Yu, IN-
     dation, in: Proceedings of the 28th International              SPIRED: Toward sociable recommendation dialog
     Conference on Computational Linguistics (COL-                  systems, in: Proceedings of the 2020 Conference on
     ING), 2020, pp. 5288–5308.                                     Empirical Methods in Natural Language Processing
[11] W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M. Kan,                (EMNLP), 2020, pp. 8142–8152.
     T. Chua, Estimation-action-reflection: Towards            [21] W. Cai, L. Chen, Predicting user intents and sat-
     deep interaction between conversational and rec-               isfaction with dialogue-based conversational rec-
     ommendations, in: Proceedings of the 28th ACM
     Conference on User Modeling, Adaptation and Per-
     sonalization, 2020, pp. 33–42.
[22] W. Wu, Z. Guo, X. Zhou, H. Wu, X. Zhang, R. Lian,
     H. Wang, Proactive human-machine conversation
     with explicit conversation goals, arXiv preprint
     arXiv:1906.05572 (2019).
[23] H. Zhou, M. Huang, T. Zhang, X. Zhu, B. Liu, Emo-
     tional chatting machine: Emotional conversation
     generation with internal and external memory, in:
     S. A. McIlraith, K. Q. Weinberger (Eds.), Proceedings
     of the Thirty-Second AAAI Conference on Artificial
     Intelligence, 2018, pp. 730–739.
[24] D. Cai, Y. Wang, W. Bi, Z. Tu, X. Liu, S. Shi,
     Retrieval-guided dialogue response generation via
     a matching-to-generation framework, in: Proceed-
     ings of the 2019 Conference on Empirical Methods
     in Natural Language Processing and the 9th Inter-
     national Joint Conference on Natural Language Pro-
     cessing (EMNLP-IJCNLP), 2019, pp. 1866–1875.
[25] H. Zhou, C. Zheng, K. Huang, M. Huang, X. Zhu, Kd-
     Conv: A Chinese multi-domain dialogue dataset to-
     wards multi-turn knowledge-driven conversation,
     in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics (ACL),
     2020, pp. 7098–7108.
[26] I. Sutskever, O. Vinyals, Q. V. Le, Sequence
     to sequence learning with neural networks, in:
     Z. Ghahramani, M. Welling, C. Cortes, N. D.
     Lawrence, K. Q. Weinberger (Eds.), Advances in
     Neural Information Processing Systems 27: Annual
     Conference on Neural Information Processing Sys-
     tems 2014, 2014, pp. 3104–3112.
[27] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang,
     J. Gao, M. Zhou, H. Hon, Unified language model
     pre-training for natural language understanding
     and generation, in: H. M. Wallach, H. Larochelle,
     A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Gar-
     nett (Eds.), Advances in Neural Information Pro-
     cessing Systems 32: Annual Conference on Neural
     Information Processing Systems (NeurIPS), 2019,
     pp. 13042–13054.
[28] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
     I. Sutskever, et al., Language models are unsuper-
     vised multitask learners, OpenAI blog 1 (2019) 9.
[29] D. P. Kingma, J. Ba, Adam: A method for stochastic
     optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd
     International Conference on Learning Representa-
     tions (ICLR), 2015.
[30] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang,
     X. Tian, D. Zhu, H. Tian, H. Wu, Ernie: En-
     hanced representation through knowledge integra-
     tion, arXiv preprint arXiv:1904.09223 (2019).

</pre>