End-to-End Learning for Conversational Recommendation:
                         A Long Way to Go?
                              Dietmar Jannach                                                             Ahtsham Manzoor
                     University of Klagenfurt, Austria                                              University of Klagenfurt, Austria
                         dietmar.jannach@aau.at                                                        ahtsham.manzoor@aau.at
ABSTRACT                                                                               1   INTRODUCTION
Conversational Recommender Systems (CRS) have received in-                             In Conversational Recommender Systems (CRS), a software agent
creased interest in recent years due to advances in natural language                   interacts with users in a multi-turn dialog with the goal of support-
processing and the wider use of voice-controlled smart assistants.                     ing them in finding items that match their preferences [9]. From
One technical approach to build such systems is to learn, in an                        an interaction perspective, such systems therefore go beyond the
end-to-end way, from recorded dialogs between humans. Recent                           one-shot recommendation paradigm of typical recommenders that
proposals rely on neural architectures for learning such models.                       can be found, e.g., on e-commerce sites, as they try to elicit the
These models are often evaluated both with the help of computa-                        user’s specific preferences in an interactive conversation. Such CRS
tional metrics and with the help of human annotators. In the latter                    have been in the focus of researchers for more than two decades
case, the task of human judges may consist of assessing the utter-                     now, starting with early critiquing approaches in the mid-1990s [7].
ances generated by the models, e.g., in terms of their consistency                     Since then, a number of alternative technical approaches were pro-
with previous dialog utterances.                                                       posed, ranging from more elaborate critiquing techniques [3], over
   However, such assessments may tell us not enough about the                          knowledge-based advisory systems [8], to learning-based systems
true usefulness of the resulting recommendation model, in particu-                     that are based, e.g., on reinforcement learning techniques [5, 15].
lar when the judges only assess how good one model is compared                            In recent years, CRS have gained increased research interest
to another. In this work, we therefore analyze the utterances gen-                     again, mostly due to technological progress in the context of nat-
erated by two recent end-to-end learning approaches for CRS on                         ural language processing (NLP) and the wide-spread use of voice-
an absolute scale. Our initial analyses reveals that for each system                   controlled smart assistants like, e.g., Apple’s Siri, Amazon’s Alexa
about one third of the system utterances are not meaningful in the                     or Google Home. Differently from many previous works that are
given context and would probably lead to a broken conversation.                        based on substantial amounts of statically defined domain knowl-
Furthermore, about less than two third of the recommendations                          edge (e.g., about item properties or pre-defined dialog states and
were considered to be meaningful. Interestingly, none of the two                       transitions), some recent approaches try to adopt an end-to-end
systems “generated” utterances, as almost all system responses                         learning approach [4, 11–13]. Informally speaking, the main task in
were already present in the training data. Overall, our works shows                    such an approach is to learn a machine learning model given a set
that (i) current approaches that are published at high-quality re-                     of recommendation dialogs that were held between humans. One
search outlets may have severe limitations regarding their usability                   promise of such solutions is that the amount of knowledge engi-
in practice and (ii) our academic evaluation approaches for CRS                        neering can be kept low and that such a system should continuously
should be reconsidered.                                                                improve when more dialogs become available.
                                                                                          The evaluation of the usefulness of such CRS in academic envi-
CCS CONCEPTS                                                                           ronments in general is, however, challenging. To assess the quality
                                                                                       of a system, it is not only important to check if the recommenda-
• Information systems → Recommender systems;                                           tions are adequate in a given dialog situations, one has also to assess
                                                                                       the quality of the dialog itself. Dialog quality could, for example,
KEYWORDS                                                                               also relate to the question if a system is able to react to chit-chat
Conversational Recommender Systems; Evaluation.                                        utterances (phatic expressions) in an appropriate way.
                                                                                          In some recent works, researchers use a combination of objective
ACM Reference Format:                                                                  and subjective measures to assess a CRS. Objective measures can,
Dietmar Jannach and Ahtsham Manzoor. . End-to-End Learning for Conver-                 for example, be typical recommendation accuracy measures, but
sational Recommendation:, A Long Way to Go?. In IntRS Workshop at ACM                  may also include linguistic measures like perplexity that capture the
RecSys ’20, September 2020, Online. ACM, New York, NY, USA, 5 pages.                   fluency of natural language [4]. Subjective evaluations sometimes
                                                                                       include judgments of independent human evaluators. In [4], for
                                                                                       example, the task of the evaluators was to rate the consistency of a
                                                                                       system-generated utterance at a given dialog state on an absolute
                                                                                       scale. In [12], a ranking of different alternatives was requested from
IntRS@RecSys ’20, September 2020, Online
© Copyright (c) 2020 for this paper by its authors. Use permitted under Creative       the annotators. Such a form of human evaluation however has some
Commons License Attribution 4.0 International (CC BY 4.0).IntRS ’20 - Joint Workshop   limitations. In case of relative comparisons, we do not know if any
on Interfaces and Human Decision Making for Recommender Systems, September 26,
2020, Virtual Event.
                                                                                       of the compared systems are useful at all. In case of an absolute
                                                                                       scale, the KBRD system from [4], for example, reached a consistency
score of about 2 on a scale from one to three. This, however, cannot         on the assumed interests of the seeker. The conversations were
inform us fully about how useful such systems are in practice. In            collected through a web-based interface, where the crowdworkers
a deployed application, users might, for example, not tolerate too           typed their utterances in natural language. At least four movie
many conversation breakdowns, where the system is incapable of               mentions were required per dialog, and each dialog had to have
responding in a reasonable way.                                              at least ten utterances. The resulting conversations were then fur-
   In this work, we therefore analyze quality aspects of two state-          ther enriched. Movie mentions were tagged with movie names and
of-the-art end-to-end learning systems ([4, 12]) in a complementary          release years. Furthermore, different labels were assigned to the
way. Specifically, we ask human evaluators to assess, using a binary         movies, e.g., whether or not a dialog seeker has seen or liked it.
scale, if a given response by the system appears meaningful to
                                                                                Original Evaluation. The DeepCRS system was evaluated in dif-
them or not. Examples of non-meaningful utterances would be a
                                                                             ferent dimensions, including the quality of sentiment classification,
repetition of what was previously said or a system utterance that
                                                                             recommendation quality, and overall dialog assessment. Accuracy
does not match the context of the dialog. In addition, we asked
                                                                             was measured in terms of the Cohen’s kappa coefficient [6] given
the evaluators to judge every item recommendation in a subjective
                                                                             the like/dislike labels in the ReDial dataset, which however has a
way.
                                                                             very skewed distribution with over 90% like statements. The more
   Our analysis shows that about one third of the utterances gener-
                                                                             interesting part was the human evaluation. Here, ten participants
ated by the investigated systems are considered to be not meaning-
                                                                             of a user study were given ten dialogs from the ReDial dataset that
ful in the given dialog context. Moreover, in both of the systems,
                                                                             contained 56 system-generated utterances. The task of the partici-
about more than one third of the recommendations did not suit the
                                                                             pant was to rank each utterance compared to (a) the true utterance
assumed preferences of the recommendation seeker. These obser-
                                                                             in the original dialog and (b) the utterance generated by the HRED
vations raise questions both regarding the practical usefulness of
                                                                             model [16], which was used as a baseline. The results of the ranking
the proposed systems and the way we evaluate CRS in academia. A
                                                                             exercise showed that the human recommendations were most often
main implication of our analyses is that more realistic ways of eval-
                                                                             considered to be the best ones and that the proposed DeepCRS
uating CRS are needed. In particular, such evaluation approaches
                                                                             model was better than HRED. An example of different responses in
should help us understand if a (i) proposed end-to-end learning
                                                                             a given situation is shown in Table 1.
system reaches a quality level, in terms of generating plausible
responses, that is actually acceptable for users and if it is (ii) able to
avoid bad recommendations that can be detrimental to the system’s            Table 1: Example of Conversation with Alternative Recom-
quality perception and use [2].                                              mendation Utterances used for Evaluation [12].

2     ANALYZED APPROACHES                                                    ...
    Technical Approaches – DeepCRS and KBRD. We analyzed two re-             SEEKER:      2001 : a space odyssey might be a great option. any-
cent approaches from the literature. The first one, which we denote                       thing else that you would suggest ?
as DeepCRS, was published at NeurIPS 2018 [12]. Its architecture             HUMAN:       you can try planet of the apes the older one is quite
consists of four sub-components, which accomplish different tasks                         suspenseful and family friendly .
such as sentence encoding, next-utterance prediction, sentiment              HRED:        i haven’t seen that one yet but i heard it was good. i
classification and recommendation. Technically, the architecture                          haven’t seen that one. have you seen the last house on
is inspired by the hierarchical HRED architecture from [16] and                           the left ?
based on RNNs and an autoencoder for the recommendation task.                DeepCRS:     star wars : the force awakens is also a good one return
    The second approach is called KBRD [4] (Knowledge-based Rec-                          of the jedi all good movies
ommender Dialog System) and was published at EMNLP-IJCNLP ’19.
The system’s components include a Transformer-based sequence-
to-sequence module as a dialog system, a knowledge graph that                   The KBRD approach was evaluated in three dimensions. Two
captures dialog-external knowledge about the domain (movies),                computational metrics, perplexity and distinct n-gram measure the
and a switching network that connects the dialog and knowledge               fluency and diversity of the natural language. Recommendation
module.                                                                      quality was measured in terms of Recall. KBRD proved to be fa-
                                                                             vorable over two baselines (DeepCRS and a Transformer model)
   Underlying Data – The ReDial Dataset. Both approaches, Deep-              in terms of all computational metrics. For the human evaluation,
CRS and KBRD, are trained on the ReDial1 dataset. This dataset               ten annotators with knowledge in linguistics were asked to assess
was collected by the authors of DeepCRS with the help of crowd-              the consistency of a generated utterance with the previous dialog
workers and consists of more than 10,000 conversations between               history on a scale from 1 to 3. An average consistency rating of 1.99
a recommendation seeker and a recommender. The crowdworkers                  was obtained, which was about 15% higher than the average rating
were given specific instructions regarding the conversation: For             for the DeepCRS baseline.
example, each participant had to take one of two roles, seeker or               Overall, the (relative) ranking exercise for DeepCRS unfortu-
recommender. The seeker had to specify which movies she likes,               nately does not tell us much about the absolute meaningfulness
and the recommender’s task was to make recommendations based                 and usefulness of the system-generated utterances. For KBRD, the
                                                                             evaluation was done on an absolute scale. The average score was at
1 https://redialdata.github.io/website/
                                                                             about 2 on the 1-3 scale, i.e., in the middle. No details are, however,
provided regarding the distribution of the ratings. It is, for exam-                    that typical non-meaningful sentences include situations where
ple, not clear if trivial system responses like goodbye to a seeker’s                   the system ignores the last user-utterance intent, repeats questions,
goodbye were counted as consistent dialog continuations.                                abruptly ends the conversation, provides broken or incomplete re-
   Given the difficulty of assessing the usefulness of the proposed                     sponse, or makes a bad recommendation. The judgment of what
systems from the reported studies, our goal was to assess the quality                   represents a bad recommendation is in many cases clear, e.g., when
of the system responses through a complementary analysis.                               the system recommended a movie that the seeker has just men-
                                                                                        tioned, but to some extent it remains a subjective assessment.
3    ANALYSIS METHODOLOGY
Looking at example conversations published in the supplementary                         4    RESULTS
material of [12]2 , we found that even in these hand-selected exam-                        Analysis of Generated Sentences. Table 2 shows the characteristics
ples many system responses (labeled as OURS) were not meaningful.                       of the utterances that are generated by DeepCRS and KBRD. Even
One main goal of our analysis was to quantify the extent of such                        though both algorithms were fed with the same dialogs to compete,
problems. Our analysis procedure was as follows.                                        the number of generated sentences for KBRD is lower as this method
   First, we randomly selected 70 dialogs from the ReDial test                          did not always return a response.
dataset. We then used the code provided by the authors of Deep-                            Regarding the novelty of the returned sentences, we, to some
CRS and KBRD to reproduce the systems and to generate system                            surprise, found that DeepCRS almost exclusively returns sentences
responses after each seeker utterance, given the dialog history up                      that are found in identical form in the training data (except for the
to that point. As a result, we obtained 70 dialogs, which not only                      placeholder for movie names that are eventually replaced by the
contained the original seeker and human recommender utterances,                         algorithm). KBRD also mostly returns sentences that are contained
but also the recommender sentences that were generated by the                           in the training data or are tiny modifications of such sentences. Five
respective CRS.                                                                         of the generated sentences were not in the training data. However,
   In total, 758 system responses, 399 by DeepCRS and 359 by KBRD,                      there were also 11 generated sentences that were broken.
were generated this way. We analyzed the responses both through                            Overall, it is surprising that both systems mostly return sen-
manual and automated processes. For replicability, we share all                         tences they found in the training data, which resembles more a
study materials online3 . The following main analyses were made:                        retrieval approach than a language generation problem. Measuring
    (1) Creativeness or novelty of responses wrt. training data;                        linguistic properties, e.g., perplexity on the sentence level, of what
    (2) Meaningfulness of system responses in the given context;                        are genuinely human sentences, therefore is not too meaningful.
    (3) Quality of the recommendations;                                                   Analysis of Dialog and Recommendation Quality. In Table 3, we
   In the context of analysis (1), we were wondering how different                      show the results of the labeling process by the annotators. The
the system-generated responses are from the training data. This is                      numbers in the table correspond to the rounded average of two
in particular relevant as the authors of KBRD measure perplexity                        annotators who, as mentioned above, have a very high agreement.
and n-gram distance for the generated sentences. In our analysis,
we therefore counted which fraction of the system responses was                                Table 2: Characteristics of Generated Sentences
contained in an identical or almost identical form4 in the training                                                          DeepCRS KBRD
data. In case the generated sentences were mostly identical to sen-                         Generated Sentences                         399      359
tences appearing in the training data, measuring perplexity and the                         Unique Sentences                             46      159
n-gram distance of what are mostly genuine human utterances is                              Identical in Training Data                   44       87
not very informative.                                                                       Almost Identical in Training Data             2       59
   To measure aspects (2) and (3), we relied on human annotators                            New Sentences                                 0        5
who marked each generated system response in the 70 dialogs                                 Broken Sentences                              0       11
as being meaningful or not. Furthermore, we asked them to label
each utterance as being chit-chat or containing a recommendation.
Two of the three annotators were PhD students at two universities                        Table 3: Analysis of Dialog and Recommendation Quality
with no background in conversational recommendation. One was                                                                 DeepCRS     KBRD
evaluating the DeepCRS responses, the other annotated the KBRD                              Number of dialogs                               70          70
responses. They were not informed about the background of their                             Generated sentences (overall)                  399         359
task. To obtain a second opinion and to avoid potential biases, both
datasets were also manually labeled by one of the authors of this                           Sentences labeled as meaningful         277 (69%)    209 (58%)
paper. The annotator agreement was generally very high (92.73%                              Sentences labeled as not meaningful     122 (31%)    150 (42%)
for DeepCRS and 93.89% for KBRD).                                                           Dialogs without problems                        5            5
   When instructing the external annotators, we did not provide                             Chit-chat sentences                           132           88
specific instructions what “meaningful” means. Our analysis shows                           Chit-chat labeled as meaningful         112 (85%)     77 (87%)
2 https://papers.nips.cc/paper/8180-towards-deep-conversational-recommendations             Number of recommendations                      106          119
3 https://drive.google.com/drive/folders/10gPOmaiFrZjIULIa3LsdmuyvJvnCV_Xq                  Recs. labeled as meaningful               63 (60%)    66 (55%)
4 We considered sentences to be almost identical if the same set of words appeared in
                                                                                            Nb. dialogs with no meaningful recs.      25 (36%)    20 (28%)
them with the same frequency, i.e., only the order was changed.
                                                                                            Nb. dialogs with no rec. made.             7 (10%)     6 (8.5%)
   The results show that for both systems a substantial fraction of           The DeepCRS and KBRD systems did not provide explanations
the generated responses—about 40%—were considered not mean-                in the dialogs that we examined, except in cases where the system
ingful by the annotators. As a result, there are only 5 (7%) dialogs       generated some sort of confirmatory utterances (“it is a very good
for which there is not at least one issue. Overall, these findings         movie”). Such utterances appeared in the training data and were
raise the question if such high failure rates would be acceptable by       correspondingly sometimes selected by the systems. The number of
users in practice?                                                         user intents that are actually supported by DeepCRS and KBRD are
   A major fraction of the generated sentences (33% for DeepCRS,           generally quite low, see [1, 9] for a list of possible intents in CRS.
24% for KBRD) were considered chit-chat. The analyzed systems              For the DeepCRS system, for example, the authors also explicitly
were performing better in terms of generating such chit-chat mes-          state that with their recommendation mechanism they are unable
sages than they were when generating other types of utterances.            to respond to a seeker who asks for “a good sci-fi movie” [12]. Not
The percentage of meaningful chit-chat responses is 85% and 87%            being able to support intents related to explanations or feature-
for DeepCRS and KBRD respectively. However, a larger fraction of           based requests again raises questions about the practicability of the
these chit-chat exchanges consist of trivial responses to ’hello’, ’hi’,   investigated approaches.
’goodbye’ and ’thank you’ utterances by the recommendation seeker.            Finally, when examining the dialogs that we sampled, we ob-
Overall, the chit-chat messages account for 39% of all generated           served that a few conversations were broken. This was for example
sentences that were marked as meaningful.                                  the case when a crowdworker had not understood the instructions.
   The analysis of the quality of the recommendations themselves is        In one case, for example, the seeker was not interested in a recom-
what we thought would be a more subjective part of our evaluation.         mendation, but rather told the recommender that he would like
The agreement between the annotators was, however, very high               to watch a certain movie. In our sample of 70 dialogs, we found
(93% for DeepCRS and 92% for KBRD). The annotators both relied on          9 cases we considered to be broken. To what extent such noise in
their own expertise in the movie domain and used external sources          the data impacts the performance of end-to-end learning systems
like movie databases such as IMDb to check the plausibility of the         however requires more investigations in the future.
recommendations. Recommendations were typically considered not
meaningful when the annotators could not establish any plausible
                                                                           5   CONCLUSIONS AND IMPLICATIONS
link between seeker preferences and recommendations. An example
is the system’s recommendation of the movie “The Secret Life               We performed an alternative and independent evaluation of two
of Pets” after the seeker mentioned that s/he liked “Avengers -            recently published end-to-end learning approaches to building con-
Infinity Wars”. Quite interestingly, the subjective performance of         versational recommender systems. A manual inspection of the re-
KBRD is lower than for DeepCRS even though KBRD includes a                 sponses of the two systems reveals that these systems in many cases
knowledge graph, called DBpedia 5 , that contains with information         fail to react in a meaningful way to user utterances. The quality of
about movies and their relationships.                                      the recommendations in these dialogs also appears to be limited.
   Overall, the results in Table 3 show that the perceived recom-             Our findings have important implications. First, current evalua-
mendation quality is modest. While for the DeepCRS and KBRD                tion practices, at least those from the analyzed papers, seem to be
systems, less than two third of the movie recommendations were             not informative enough to judge the practical usefulness of such
considered meaningful. However, DeepCRS produced not even a                systems. Relying on relative subjective comparisons (as in [12])
single recommendation in 7 (10%) dialogs and this was the case for         cannot inform as about whether or not the better-ranked system is
6 (8.5%) dialogs with KBRD.                                                actually good. Absolute evaluations (as in [4]) indicated mediocre
                                                                           outcomes, but the aggregation of the human ratings into one single
   Limitations of the ReDial Dataset. The existence of large-scale         metric value prevents us from understanding how good the sys-
datasets containing human conversations is a key prerequisite for          tem works for specifics parts of the conversation (e.g., chit-chat,
building a CRS based on end-to-end learning. The ReDial dataset is         recommendation).
an important step in that direction. However, the dataset also has a          Measuring linguistic aspects like perplexity—or the BLEU score
number of limitations. This mostly has to do with the way it was           as done in some other works [10, 13]—might in principle be helpful,
created with the help of crowdworkers, which were given specific           even though there are some concerns regarding the correspondence
instructions about the minimum number of interactions and the              of BLEU scores with human perceptions [14]. However, our analysis
minimum number of movie mentions.                                          revealed that the examined systems almost exclusively generate
   As a result, many dialogs are not much longer than the mini-            sentences that were already present in the training data in identical
mum length, and the dialogs do not enter deeper discussions. The           or almost identical form. These objective measures, when applied on
expression of preferences is very often based on movie mentions            the sentence level, would therefore mainly judge, e.g., the perplexity
and only to a lesser extent based on preferences regarding certain         of the sentences by human recommenders. Therefore, a retrieval
features like genre or directions. The responses by the human rec-         based approach might achieve the same performance or even better.
ommenders are also mainly movie mentions. An explanation why               This is an interesting future direction that we intend to explore.
a recommendation is a good match for the seeker’s preferences are             As a result, our work calls for extended, alternative, and more
not very common. Developing an end-to-end system that is capable           realistic evaluation practices for CRS. In particular, in practical
of providing explanations, which might be a helpful feature in any         applications, certain guarantees regarding the quality of the system
CRS, therefore might remain challenging.                                   responses and recommendations might be required, which might
5 https://wiki.dbpedia.org/                                                be difficult to achieve with current end-to-end learning approaches.
REFERENCES                                                                              [10] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul Crook, Y-Lan Boureau,
[1] Wanling Cai and Li Chen. 2020. Predicting User Intents and Satisfaction with             and Jason Weston. 2019. Recommendation as a Communication Game: Self-
    Dialogue-Based Conversational Recommendations. In UMAP ’20. 33–42.                       Supervised Bot-Play for Goal-oriented Dialogue. In EMNLP-IJCNLP ’19. 1951–
[2] Patrick Y.K. Chau, Shuk Ying Ho, Kevin K.W. Ho, and Yihong Yao. 2013. Examining          1961.
    the effects of malfunctioning personalized services on online users’ distrust and   [11] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul A. Crook, Y-Lan
    behaviors. Decision Support Systems 56 (2013), 180–191.                                  Boureau, and Jason Weston. 2019. Recommendation as a Communication Game:
[3] Li Chen and Pearl Pu. 2012. Critiquing-based recommenders: survey and emerg-             Self-Supervised Bot-Play for Goal-oriented Dialogue. ArXiv abs/1909.03922
    ing trends. User Modeling and User-Adapted Interaction 22, 1-2 (2012), 125–150.          (2019).
[4] Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang,         [12] Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent
    and Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In                Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In
    EMNLP-IJCNLP ’19. 1803–1813.                                                             NIPS ’18. 9725–9735.
[5] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards      [13] Lizi Liao, Ryuichi Takanobu, Yunshan Ma, Xun Yang, Minlie Huang, and Tat-Seng
    conversational recommender systems. In KDD ’16. 815–824.                                 Chua. 2019. Deep Conversational Recommender in Travel. ArXiv abs/1907.00710
[6] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational            (2019).
    and psychological measurement 20, 1 (1960), 37–46.                                  [14] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and
[7] Kristian J. Hammond, Robin Burke, and Kathryn Schmitt. 1994. Case-Based                  Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical
    Approach to Knowledge Navigation. In AAAI ’94.                                           Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In
[8] Dietmar Jannach. 2004. ADVISOR SUITE – A Knowledge-based Sales Advisory                  EMNLP ’16. 2122–2132.
    System. In ECAI ’04. 720–724.                                                       [15] Tariq Mahmood and Francesco Ricci. 2009. Improving Recommender Systems
[9] Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li’e Chen. 2020. A Survey             with Adaptive Conversational Strategies. In HT ’09. 73–82.
    on Conversational Recommender Systems. ArXiv abs/2004.00646 (2020).                 [16] Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
                                                                                             Pineau. 2016. Building End-to-End Dialogue Systems Using Generative Hierar-
                                                                                             chical Neural Network Models. In AAAI ’16. 3776–3783.