=Paper= {{Paper |id=Vol-2202/paper4 |storemode=property |title=Dialogue Modeling Via Hash Functions |pdfUrl=https://ceur-ws.org/Vol-2202/paper4.pdf |volume=Vol-2202 |authors=Sahil Gargn,Guillermo Cecchi,Irina Rish,Shuyang Gao,Greg Ver Steeg,Sarik Ghazarian,Palash Goyal,Aram Galstyan |dblpUrl=https://dblp.org/rec/conf/ijcai/GargCRGSGGG18 }} ==Dialogue Modeling Via Hash Functions== https://ceur-ws.org/Vol-2202/paper4.pdf

Dialogue Modeling Via Hash Functions
Sahil Garg1 , Guillermo Cecchi2 , Irina Rish2 , Shuyang Gao1 ,
Greg Ver Steeg1 , Sarik Ghazarian1 , Palash Goyal1 , Aram Galstyan1
1
USC Information Sciences Institute, Marina del Rey, CA USA
2
IBM Thomas J. Watson Research Center, Yorktown Heights, NY USA
sahil.garg.cs@gmail.com, {gcecchi,rish}@us.ibm.com, sgao@isi.edu,
{gregv,ghazarian,goyal,galstyan}@isi.edu

Abstract Its all my fault. It’s like getting hit in the stomach And trying
to catch your breath. You can’t catch it. You struggle, move
We propose a novel machine-learning framework around, You just can’t get your breath. That just really knocks
for dialogue modeling which uses representations me off balance. You begin to generate this feeling of a kind
of negativism. And that’s why it really being hurt. I never al-
based on hash functions. More specifically, each lowed myself to think negative, accept everybody else. I am
person’s response is represented by a binary hash- the only one giving back to the ”be quiet time.” I had to have
code where each bit reflects presence or absence something to make do with that.
of a certain text pattern in the response. Hashcodes 0101001111001011111110101101011110011011100011101
serve as compressed text representations, allow- OK. Maybe that fits into the whole pattern then.
ing for efficient similarity search. Moreover, hash- 0111001101001011111100011111111110001010110011001
code of one person’s response can be used as a
feature vector for predicting the hashcode repre- Table 1: In this table, we show pairs of patient and therapist (Blue)
senting another person’s response. The proposed responses. For each response, we obtain a locality sensitive hash-
hashing model of dialogue is obtained by maxi- code as its representation. The hashcode of a patient response is used
mizing a novel lower bound on the mutual infor- as feature representation to infer the hashcode of the corresponding
mation between the hashcodes of consecutive re- therapist response.
sponses. We apply our approach in psychother- compared to the number of potential patients. Thus, creat-
apy domain evaluating its effectiveness on a real- ing automated systems providing mental health services is
life dataset consisting of therapy sessions with pa- becoming an increasingly important area of research and de-
tients suffering from depression; in addition, we velopment; however, existing conversational systems in this
also model transcripts of interview sessions be- domain are still primarily rule-based, and lack sophisticated
tween Larry King (television host) and his guests. dialogue modeling and generation capabilities based on ac-
tual text understanding.
An interesting feature of therapeutic dialogues appears to
1 Introduction be a classical pattern of a relatively long monologue of a pa-
Dialogue modeling and generation is an area of active re- tient followed by a considerably shorter response of a thera-
search, and of great practical importance, as it provides a ba- pist, e.g., a typical example is shown in Tab. 1 (taken from
sis for building successful conversational agents in a wide a dataset we experimented with, containing multiple tran-
range of applications. While an open-domain dialogue re- scribed depression therapy sessions.) Moreover, therapist’s
mains a challenging open problem, developing dialogue sys- responses tend to be rather high-level, generic statements,
tems for particular applications can be more tractable, due to typically summarizing/confirming the patient’s response, and
specific properties of the application domain. can be viewed as a “category label” being “assigned” to a
A motivating application for our work is (semi-)automated patient’s text “sample”. Generating such a short, more con-
psychotherapy: easily accessible, round-the-clock psy- strained response can be a somewhat easier task than solving
chotherapeutic services provided by a conversational agent. the open-domain dialogue challenge. Note that, rather than
According to recent estimates, mental health disorders af- generating a specific response sentence directly, a more ef-
fect one in four adult Americans, one in six adolescents, and fective approach can be to first infer a conceptual, high-level
one in eight children. Furtermore, as predicted by the World representation reflecting the essence of the “most relevant”
Health Organization, by 2030 the amount of worldwide dis- response to the patient.
ability and life loss attributable to depression may become Motivated by above considerations, we introduce here a
greater than for any other condition, including cancer, stroke, novel dialogue modeling framework where responses are rep-
heart disease, accidents, and war. However, many people do resented as locality-sensitive binary hashcodes [Kulis and
not receive an adequate treatment; one of the major factors Grauman, 2009; Joly and Buisson, 2011; Garg et al., 2018b].
here is limited availability of mental health professionals, as The motivation behind such approach includes several con-

24
agents (capturing relevance between the two), used as
an optimization criterion to optimize a hashing model;
• an approach using the MI lower bound as a metric for
joint evaluation of the representation quality of hash-
codes, and the quality of inference. Also, a tight upper
bound on joint entropy is derived, to separately charac-
terize the quality of hashcodes as representations.
Empirical evaluation of the proposed framework and the op-
timization approach, which uses both kernels and neural net-
works for locality sensitive hash functions, is performed on
a dataset containing textual transcriptions of 400 depression-
therapy sessions conducted by human therapists with real pa-
tients. We find that our approach of optimizing hashing by
maximizing our novel MI lower bound provides a consistent
improvement over alternative types of hashing approaches
(using both kernels and neural nets), in terms of the proposed
Figure 1: An illustration of the locality sensitive hashing approach. evaluation metrics. Further, general applicability of our work
In the figure, each dialogue response is represented as a dot. In the is demonstrated by experiments on another dataset of tran-
top left of the figure, for building a hash function, 4 responses are scriptions of 75 interview sessions between television host
selected randomly for assignment of artificial red color labels, repre- Larry King and his guests.
senting hash bit value 1, and another 4 responses selected randomly
for assignment of blue labels. Having these 8 responses with artifi-
cially assigned labels, we can learn a maximum margin boundary as 2 Related Work
hash function h1 (.), i.e. training a kernel-SVM model discriminat-
ing between the red and blue labeled dialogue responses; instead of
Dialogue agents, e.g. chat bots, are becoming increasingly
a maximum margin boundary, one can learn a kernel or neural lan- popular in various applications, including (semi)-automated
guage model based (regularized) boundary as well. Following same psychotherapy, for example, based on popular techniques
procedure, we assign artificial labels to other small subsets of re- such as CBT (Cognitive Behavioral Therapy [Lewinsohn et
sponses, so as to learn hash functions h2 (.), h3 (.), and so on. In the al., 1990]); however, these agents have very limited abil-
bottom right of the figure, we see hashcodes assigned to responses ities to actually understand free text responses from their
as per the three hash functions; the dark gray dots represent those users; instead, they are typically offering a fixed set of pre-
responses which we used for building the hash functions, referred programmed responses to choose from [Fitzpatrick et al.,
as a reference set; for computing hashcode of a given test response 2017]. See [Jurafsky and Martin, 2014] for an overview.
(light gray dot), we need to compute its kernel similarities w.r.t. the
dark grey dots only. In practice, we first select the reference set (dark
There are several neural networks based approaches pro-
grey dots), as a random or optimized subset of all the responses (all posed in the recent past for dialogue modeling in general
the dots), and then take smaller random subsets (red or blue labels) domain [Serban et al., 2015; 2016; 2017a; 2017b; Shao et
out of the reference set for building hash functions. al., 2017; Asghar et al., 2017; Wu et al., 2017]. However,
the setting is somewhat different from the therapy dialogues
siderations: (a) it can be easier to infer a generalized, inter- where the patient responses can be extremely long (up to tens
pretable representation (e.g., binary hascode) of a response of thousands of words). Also, evaluating the effectiveness of
(along with the inverse mapping) instead of a specific textual the therapist’s response requires some notion of relevance
response (due to potentially smaller search space); (b) hash- which goes beyond the standard measures of its semantic fea-
codes of patient responses can be used as feature vectors to tures [Papineni et al., 2002; Liu et al., 2016; Li and Jurafsky,
infer the hashcodes of therapist responses (multiple binary 2016; Lowe et al., 2017; Li et al., 2017]; we consider here an
class labels), as previoiusly demonstrated for an information- information-theoretic approach to capture this notion. Also,
extraction task [Garg et al., 2018a]; (c) an inferred therapist’s therapy dialogue modeling and generation has some similar-
hashcode can be efficiently mapped to multiple text sentences ities with the task-driven dialogue [Zhai and Williams, 2014;
(possible choices for the therapist response). See Tab. 1. Wen et al., 2016; Althoff et al., 2016; Lewis et al., 2017;
While the above approach was motivated by psychotherapy He et al., 2017], although evaluating effectiveness of a thera-
domain, it is generally applicable to a wide range of other do- peutic dialogue may be more challenging as the effect is not
mains involving dialogue modeling and generation. The key always immediate. Attention to specific parts of the response,
contributions of this work include: as well as background knowledge, explored in neural network
• a novel, generic framework for modeling a dialogue us- based dialogue modeling [Kosovan et al., ] can be quite help-
ing locality sensitive hash functions (based on kernel ful in therapeutic dialogues; those aspects are, to some extent,
similarities or neural networks, as discussed later) to rep- implicitly captured by learning the hash models. Note that in
resent textual responses; related work by [Bartl and Spanakis, 2017], hashcodes are
not treated as representations of the responses, and used only
• a novel lower bound on the Mutual Information (MI) for the nearest neighbor search, unlike the approach proposed
between the hashcodes of the responses from the two here.

25
Algorithm 1 Generalized Algorithm for Constructing Locality Sensitive Hash Functions, with Kernel Functions or Neural Networks
Require: Reference set of structures S R , of size M ; sub-sampling parameter, α, for randomizing hash function; the number of hash func-
tions, H. % Random subsets, fixed across structure inputs
1: z ← {0α , 1α } % labels for hash code bit
2: for j = 1 → H do
3: r hj ← randomSubset(M , 2α) % Random subsampling of structures for building a randomized locality sensitive hash function
Hj ← TrainBinaryClassifierAsHashFunction S R (r hj ), z % Train a binary classifier, on the random subset of structures S R (r hj )

4:
with their randomly assigned hash labels z, either using a convolution kernel similarity (subsequence/path/graph kernels) based
classification model such as, SVM, kNN, or a neural language model, such as LSTM, GRU, serving as a binary hash function
5: end for

The idea of maximizing mutual information objective for Note that, herein, we keep problem setting simple, con-
dialogue modeling has been considered previously by [Li et sidering only the previous response of the patient to make
al., 2015], though it was only used in test setting, rather than an inference of the therapist response, although the proposed
as a guide for training. approach of modeling responses via hash functions is generic
An evaluation metric such as BLEU score [Papineni et al., enough for more sophisticated dialogue modeling setups such
2002] may not be the best for our application, it tries to cap- as the ones using reinforcement learning.
ture all information in text when comparing an inference w.r.t. A hashcode for a dialogue response can be thought of as
the ground truth, rather than evaluate the relevance of one re- a generalized representation of the response such that the bi-
sponse to the other. We propose several evaluation metrics nary bit values in the hashcode denote existence of impor-
based on binary hashcodes themselves, since the latter are as- tant patterns in a response which are relevant for the thera-
sumed to serve as efficient representations of text in our dia- peutic dialogue between the patient and the therapist whereas
logue context. the rest in the response is effectively noise that should be ir-
relevant from psychotherapy perspective. The additional ad-
3 Dialogue Modeling via Binary Hashing vantage of modeling responses as hashcodes is that we can
map a binary hashcode to multiple text responses efficiently,
Here we discuss our novel framework for modeling a dia- i.e. generating therapist responses as per the inferred therapist
logue between two agents using binary hash functions. In our hashcode.
formulation below, we refer to the two agents as a patient and Next, we describe our approach for generating locality sen-
a therapist, respectively, although the approach can be clearly sitive hashcodes of dialogue responses, using either kernel
applied to a variety of other dialogue settings. similarity functions, or neural models such as LSTM-s.

3.1 Problem Formulation 3.2 Locality Sensitive Hashcodes of Responses
In a dialogue session between a patient and a therapist, we using Kernel Functions or Neural Language
denote ith response from the patient as Sip , and the corre- Models
sponding response from the therapist as Sit ; in a session, The main idea behind locality sensitive hashing is that data
we have S pt = {Sip , Sit }N i=1 , with additional notations for
points that are similar to each other should be assigned hash-
sets, S̄ = {S1p , · · · , SN
p
, S1t , · · · , SN
t
}, S p = {S1p , · · · , SN
p
}, codes with minimal Hamming distance to each other, and vice
t t t
S = {S1 , · · · , SN }. In this work, we consider a response versa. A similarity/distance function need not to be defined
Si , be it from the patient or the therapist in a session, as a explicitly; see [Wang et al., 2014] for details.
natural language structure which can be simply plain text, Since locality sensitive hashing ensures that natural lan-
or text along with part of speech tags (PoS), or syntac- guage structures, that are assigned hashcodes with low ham-
tic/semantic parsing of the text. As per a typical dialogue ming distance to each other, are similar to each other, local-
modeling setup, the task would be to predict Sit given Sip , ity sensitive hashcodes should serve as generalized represen-
i.e. generating therapist responses. However, we propose a tations of language structures (a similarity/distance function
different setup. We propose to encode each response of a pa- implied as per the locality sensitive hashing model learned for
tient/therapist (Sip /Sit ) in a therapy session as a binary hash- a given task) 1 , and so for the responses in a dialogue.
Many hash functions have been proven to be locality sensi-
code (cpi /cti ∈ {0, 1}H ), and focus upon the problem of in-
tive, with rigorous mathematical proofs [Wang et al., 2017].
ferring the therapist hashcode (cti ) given the hashcode (cpi ) of
However, the literature on data driven locality sensitive hash
the patient response (Sip ), before finally mapping the inferred
functions is recent, such as based on kernel similarity func-
therapist hashcode (ct∗ i ) to multiple text response choices. tions [Kulis and Grauman, 2009; Joly and Buisson, 2011;
Note that cti is the hashcode representation of the ground
Garg et al., 2018b]; see Fig. 1 for a demonstration of local-
truth therapist response Sit (so it is the groundtruth itself),
p ity sensitive hashing of dialogue responses. Although these
whereas ct∗ i is the inference given only the knowledge of Si ,
p t
its hashcode ci , and no knowledge of Si ; all the hashcodes 1
A similarity function doesn’t imply a semantic similarity func-
are generated using the same hashing model Mh . See Tab. 1 tion here. For instance, as per a learned locality sensitive hash-
for examples of pairs of patient/therapist responses and their ing model, the implied (valid) similarity function may account for
corresponding hashcode representations. matching of only certain patterns in textual responses.

26
Figure 2: In left side of the figure, we have patient textual responses from depression dataset, and the corresponding therapist responses (in
blue colored text) are shown on the right. We optimize a hashing model for dialogue modeling so that Mutual Information (MI) between
hashcodes of consecutive (textual) responses is maximized in a training dataset; our intuition is that, with maximization of MI, hashcodes
should encode textual information in responses that is mutually relevant for the conversation, while ignoring the rest as noise. For instance,
we would like a good hashing model to encode the bold text patterns in the patient responses, since those are highly relevant for the therapist
responses. Note that we maximize MI only for training of hashing model, not for the inference of therapist hashcodes.

works lack theoretical guaranties for locality sensitivity of s.t. 1 < α M , from the set S R ; this subset is partitioned
hash functions, the common intuitive principle in these ap- into two subsets, each of size α, through the assignment of
proaches is to learn a randomized binary hash function, con- artificially generated hash bit labels z. Having the two small
structed using a very small subset of available data points. subsets, we can learn a binary classifier (preferably regular-
For instance, in [Joly and Buisson, 2011], a kernelized hash ized), based on kernel similarities or a neural language model,
function is built by learning a (random) maximum margin that discriminates between the two subsets, acting as a hash
boundary (SVM) that discriminates between the samples of function. Following this principle, we can learn H number of
two small random subsets of a super set; same principle is hash functions, with each one taking constant number of com-
followed in [Garg et al., 2018b] for constructing a random putations for learning since the size of the subset S R (r hj ) is
hash function, i.e. obtaining a k-Nearest Neighbors based dis- a small constant, 2α M .
crimination boundary between the two random subsets using When building the hash functions using kernel similarity
kernel functions, instead of a maximum margin boundary. In functions, the size M of the reference set S R should be small.
[Kulis and Grauman, 2009; 2012], a hash function is built This is because generation of a kernelized hashcode for an
using the union of two random subsets, that is an (approxi- input structure Si using the locality sensitive hashing algo-
mately) random linear hyperplane in the kernel implied fea- rithm, requires computing convolution kernel similarities of
ture space. Despite the lack of theoretical guaranties for lo- Si with all the elements of the reference set S R . A hashcode
cality sensitivity of kernel based hash functions, the principle ci for a structure Si represents finding important substruc-
of constructing each random hash function by learning it on a
very small random subset of samples works well in practice. tures in it related to the set S R . This means, when generat-
ing hashcodes of dialogue responses as representations in our
Considering the commonality in these hashing methods
framework, we can control what patterns should be attended
due to the principle of random sub-sampling, we present a
by the therapist in the patient responses through the optimiza-
generalization in Alg. 1 for generating locality sensitive hash-
tion of the reference set itself. This idea can be very powerful
codes, that is applicable for building a random hash func-
when modeling dialogues involving long responses from pa-
tion, operating on language structures, either (i) using a con-
tients. In Sec. 4, we propose to optimize a reference set using
volution kernel, K(Si , Sj ), defining similarity between two
a novel algorithm. 2
structures Si , Sj [Mooney and Bunescu, 2005; Collins and
Duffy, 2002; Haussler, 1999], as proposed in the previous On the other hand, when constructing a parametric hash
works [Kulis and Grauman, 2009; Joly and Buisson, 2011; function by learning a binary classifier based on a neural lan-
Garg et al., 2018b], or as we additionally propose in this pa- guage model as proposed above, we don’t need to restrict the
per, (ii) using a neural language model such as LSTM, GRU, size of reference set, S R . Instead, we propose to optimize
etc. 2
In Alg. 1, the pseudo code is generically applicable for Note that the similarity function, implied from a kernel locality
obtaining hashcode of any language structure, including re- sensitive hashing model, may not be same as the kernel function.
While the kernel function may be accounting for similarity of all
sponses from a dialogue. We have a set of language struc- possible patterns in textual responses, the similarity function would
tures, S R of size M , with no class labels. For building jth be effectively matching only certain patterns, defined as per the ref-
hash function, we take a random subset S R (r hj ) of size 2α, erence set in the hashing model.

27
the architecture of all the H number of neural models, cor- Algorithm 2 Algorithm for Optimizing Neural Architecture in a
responding to H hash functions, jointly with our novel algo- Neural-Hashing Model for Dialogue Modeling by Maximizing Our
rithm in Sec. 4 while learning the weight parameters of the MI LB
neural models independently of each other. Require: Train sets, S p , S t ; maximum number of layers in neural
language models, L; the number of samples for computing the
MI lower bound, γ; values for units in a layer, u = {5, 10, 20,
3.3 Mapping Inferred Hashcodes to Textual 40, None}.
Responses % optimizing up to L layers greedily
1: for j = 1 → L do
While we propose to represent patient/therapist responses as 2: r lb ← randomSubset(N , γ) % subset of patient/therapist re-
hashcodes, and infer the hashcode for a therapist response in- sponses pairs for computing the MI LB to optimize jth layer
stead of directly inferring the textual response, we can map % Number of units in jth layer, None not applicable for 1st
an inferred therapist hashcode to multiple choices for the fi- layer
3: for l ∈ u do
nal textual response. As mentioned previously, one of the 4: n(j) ← l % l units in jth layer of neural network
main advantages of using binary hash functions to build rep- 5: C p ← computeHashcodes(S p (r lb ), n)
resentations of responses is that one can use the same binary 6: C t ← computeHashcodes(S t (r lb , n))
hashing model for indexing a repository of sentences. For 7: milb (l) ← computeMILowerBound(C p , C t )
the task of building an automated therapist (assisting a hu- 8: end for
man therapist), from a training set of therapy sessions, we 9: n(j) ← maxMILBIndex(milb , u) % choose the units with
can put all the responses from therapists into a single reposi- maximum value of MI LB
tory. Then, having a hashing model as per Alg. 1, all the re- 10: if n(j) is None then break out of loop end if
11: end for
sponses can be indexed, with binary hashcodes. Now, after 12: return n
inferring a therapist hashcode, we can search in the repository
of hashcode-indexed responses, to find multiple responses Herein, Mh denotes a hashing model, such as the one de-
having minimal hamming distance w.r.t. the inferred hash- scribed above; C p is the distribution on the hashcodes of pa-
code. Even if the number of hashcode bits were large, sublin- tient dialogue responses, and C t characterizes the distribution
ear time algorithms can be used for efficient similarity search on the hashcodes of the corresponding therapist responses.
(nearest neighbors) in hamming spaces [Norouzi et al., 2014; While minimizing the conditional entropy, H(C t |C p ; Mh ), is
Komorowski and Trzcinski, 2017]. For e.g., see Tab. 4. to improve the accuracy for the inference of the therapist re-
sponse hashcodes, maximizing the entropy term, H(C t ; Mh ),
should ensure good quality of the hashcodes as generalized
4 Learning Hashing for Dialogue Modeling representations. See Fig. 2 for an illustration.
with Mutual Information Maximization Computing mutual information between two high-
In the previous section, we introduced a framework for mod- dimensional variables can be expensive, potentially inaccu-
eling dialogues via hash functions, especially in the context of rate if the number of samples is small [Kraskov et al., 2004].
a therapeutic dialogue between a therapist and a patient. The So, we propose to optimize the hashing model, by maximiz-
question is what objective function to use for the optimization ing a lower bound on the mutual information criterion.
of the hashing model? In the following, we introduce an ob- 4.2 Information Theoretic Bounds for Optimizing
jective function that is suitable for optimizing hash functions and Evaluating Hashing for Dialogue
for dialogue modeling. The two main criteria for selecting Modeling
this objective function are as follows: (1) it should charac-
terize the quality of hashcodes as generalized representations We develop a novel lower bound on the mutual informa-
of dialogue responses; (2) it should also account for the infer- tion criterion, that is cheaper and more accurately computable
ence of therapist hashcodes, i.e. therapist dialogue generation. than the criterion itself.
Although the primary motivation of the present work is thera- Theorem 1 (Lower Bound for Mutual Information). Mu-
peutic dialogue modeling, we note that the above criteria also tual information between two hashcode distributions, I(Ct :
apply to more general dialogue modeling problems as well. Cp ; Mh ), is lower bounded as,
X
I(Ct : Cp ; Mh ) ≥ H(Ct (j); Mh ) − T C(Ct : Y ∗ ; Mh )
4.1 Objective Function Formulation to Optimize j
Hashing for Dialogue Modeling X
+ hlog q(Ct (j)|Cp ; Mh )ip(Ct (j),Cp ;Mh ) . (3)
We propose that maximizing Mutual Information between the j
hashcodes of patient and the corresponding therapist dialogue
responses is suitable to optimize on both the inference accu- Y ∗ ← arg max T C(Ct : Y; Mh ) (4)
racy as well as the representation quality. Y:|Y|=|Cp |

Herein, T C(Ct : Y; Mh ) describes the amount of Total Cor-
arg max I(Cp : Ct ; Mh ) (1) relations (Multi-variate Mutual Information) 3 within Ct that
Mh
3
I(Cp : Ct ; Mh ) = H(Ct ; Mh ) − H(Ct |Cp ; Mh ) (2) The “total correlation” quantity, also called as multi-variate Mu-
| {z } | {z } tual Information, or multi-information, was defined in [Watanabe,
representation inference 1960; Studenỳ and Vejnarová, 1998; Kraskov et al., 2005].

28
Algorithm 3 Algorithm for Optimizing Reference Set in Kernelized Hashing for Dialogue Modeling by Maximizing Our MI LB
Require: Training sets, S p , S t , S̄; initial and final size of reference set, I and M respectively; β and γ are the number of samples, as
candidates for the reference set, and for computing the lower bound, respectively.
1: r d ← randomSubset(2N , I) % random subset of indices, of size I, from {1, · · · , 2N } for initialization of reference set
% optimizing the reference set up to size M greedily
2: for j = 1 → M do
3: if j > I then
4: r d ← {r d , randomSubset(2N, 1)} % adding one more element in the reference set, that is to be optimized
5: end if
6: r ref ← randomSubset(2N , β) % subset of the structures as candidates for the reference set
7: r lb ← randomSubset(N , γ) % subset of patient/therapist responses pairs for computing the MI lower bound
8: K p ← computeKernel(S p (r lb ), S̄(r d )) % γ×j size
9: K t ← computeKernel(S t (r lb ), S̄(r d )) % γ×j size
p
10: K̄ ← computeKernel(S p (r lb ), S̄(r ref )) % γ×β size
t
11: K̄ ← computeKernel(S t (r lb ), S̄(r ref )) % γ×β size
p t
12: milb ← computeMILowerBound(K p , K t , K̄ , K̄ ) % compute the MI lower bound for all the candidates S̄(r ref ), using the
kernel matrices, via computation of hashcodes
13: r d (j) ← maxMILBDataIndex(milb , r ref ) % choose the index with maximum value of MI lower bound, from the set of candidate
indices r ref
14: end for
15: return S̄(r d )

can be explained by a latent variables representation Y; Note the similarities between the expressions for proposed
q(Ct (j)|Cp ; Mh ) is a proposal conditional distribution for the mutual information lower bound, and the joint entropy
the jth bit of therapist hashcodes built using a classifier, like upper bound; the first two terms match between the two ex-
a random forest, neural network, etc. pressions.
An interesting aspect of the quantity T C(Ct : Y; Mh ) is
that one can compute it efficiently for optimized Y ∗ that ex- For T C(Ct |Y ∗ ; Mh ) = 0, i.e. when a latent representa-
plains maximum possible Total Correlations present in Ct , tion Y ∗ is learned which explains all the total correlations
see [Ver Steeg and Galstyan, 2014] for more details on the in Ct , the upper bound becomes equal to the entropy term;
optimization; for practical purposes, the dimension of latent practically, for the case of hashcodes, learning such a repre-
representation Y can be kept much smaller than the dimen- sentation should not be too difficult, so the bound should be
sion of hashcodes, i.e. |Y| |Cp | for |Cp | 1. very tight. This is quite interesting as we are able to judge
As we observed for the mutual information criterion above, the representation quality of hashcodes, that we proposed to
we can also see different terms in the mutual information quantify via the entropy of hashcodes in the above, through
lower bound having varying roles for the optimization; for the tight upper bound in Theorem 2 which is much easier and
instance, the first two terms in (3) contribute to improve the cheap/accurate to compute than the entropy term itself. Be-
quality of hashcodes as representations, i.e. maximizing en- sides the mutual information lower bound, for more detailed
tropy of each hashcode bit while discouraging redundancies analysis, we propose to use the upper bound as an evaluation
between the bits, and the last term of conditional entropies is metric for the dialogue problem when modeling via hashing
for improving inference of hashcode bits individually. functions.
We argue to use the proposed MI lower bound, for not
only optimizing hashing for dialogue modeling, but also as The proposed information theoretic bounds are generically
an evaluation metric quantifying the representation quality as applicable for any high dimensional variables. For instance,
well as the inference of hashcodes; besides, it serves to quan- our MI LB and Entropy UB are generically applicable for
tify what is mutually relevant between the responses of two other paragraph-embedding like representations [Wieting et
dialog agents. al., 2015; Arora et al., 2016], though more efficient to com-
Further, for evaluating representation quality of hashcodes pute on binary ones. See Appendix section for derivation of
separately, in Theorem 2 below, we present a novel tight up- the bounds.
per bound for joint entropy of hashcodes.
Previously, variational lower bounds on the mutual infor-
Theorem 2 (Upper Bound for Entropy). Joint entropy of mation criterion have been proposed [Barber and Agakov,
hashcodes, H(Ct ; Mh ), is upper bounded as, 2003; Chalk et al., 2016; Gao et al., 2016; Chen et al., 2016;
X Alemi et al., 2017; Garg et al., 2018a] for settings where one
H(Ct ; Mh ) ≤ H(Ct (j); Mh ) − T C(Ct : Y ∗ ; Mh ); of the two variables is fixed (say class labels). This simplifies
j the derivation significantly, compared to our setting.
Y ∗ ← arg max T C(Ct : Y; Mh )
Y:|Y|=|Cp | Next, we discuss details on how to optimize kernel- or
neural-hashing (Alg. 1) by maximizing our MI LB.

29
Hashing Func. Hash Config. Model MI LB (Shuffled) Entropy UB NMI LB (Shuffled) Accuracy (Baseline)
Kernel-RkNN M=30, α=10
Random Sel. −12.1 ± 4.4 (−12.1 ± 2.4) 5.1 ± 4.6 N.A. 93.6 ± 8.6 (89.4 ± 12.2)
MI LB Opt. −4.2 ± 1.2 (−9.5 ± 0.2) 30.2 ± 1.1 N.A. 75.3 ± 16.5 (68.7 ± 17.4)
Kernel-RkNN M=100, α=10
Random Sel. 9.9 ± 0.4 (4.3 ± 0.2) 19.9 ± 0.3 0.50 (0.22) 88.3 ± 13.9 (84.7 ± 16.0)
MI LB Opt. 13.5 ± 0.4 (5.6 ± 0.3) 26.1 ± 0.3 0.52 (0.21) 82.7 ± 16.2 (78.4 ± 17.3)
Kernel-RkNN M=300, α=10
Random Sel. 15.3 ± 0.2 (8.6 ± 0.1) 22.7 ± 0.2 0.67 (0.38) 87.6 ± 14.7 (83.9 ± 16.6)
MI LB Opt. 15.7 ± 0.3 (8.4 ± 0.2) 24.0 ± 0.3 0.65 (0.35) 85.8 ± 16.0 (82.0 ± 17.6)
Kernel-RMM M=100, α=10
Random Sel. 10.0 ± 0.5 (3.5 ± 0.1) 23.2 ± 0.3 0.43 (0.15) 85.9 ± 11.5 (79.3 ± 13.3)
MI LB Opt. 11.4 ± 0.5 (2.8 ± 0.2) 26.0 ± 0.6 0.44 (0.11) 82.0 ± 11.6 (74.5 ± 13.2)
Kernel-RMM M=300, α=10
Random Sel. 10.3 ± 0.4 (3.6 ± 0.1) 22.1 ± 0.3 0.47 (0.16) 86.2 ± 11.9 (79.9 ± 14.0)
Neural-RLSTM M=76000, α=50
MI LB Opt. 17.2 ± 0.3 (8.0 ± 0.2) 23.2 ± 0.3 0.74 (0.34) 84.6 ± 17.2 (79.9 ± 19.9)

Table 2: Evaluating the quality inference of dialogue responses (hashcodes), for the depression psychotherapy dataset.

4.3 Details on Optimization of Kernel- or Optimizing Neural Network Architecture in
Neural-Hashing with Maximization of MI LB Neural-Hashing Models
In the above, we formalized an objective function, i.e. a lower As mentioned previously, if using language neural models for
bound on the mutual information criterion, to optimize all the hashing in Alg. 1, we can optimize the number of layers and
hash functions jointly for dialogue modeling, while keeping the units in each layer, by maximizing the proposed MI LB;
each hash function randomized so as to preserve the prop- see pseudo code in Alg. 2.
erty of locality sensitive hashing. In this section, we discuss
optimization details specific to the kernel- or neural-hashing 5 Experiments
models as described in Sec. 3. In this section, we discuss our experimental simulations, pri-
marily on a dataset of textual transcriptions of the audio
Optimizing Kernel-Hashing Models for Dialogue recordings of 400 depression therapy sessions, with each ses-
Modeling sion conducted by a human therapist with a real patient, over-
For kernel-hashing in 1, we can optimize the reference set, all involving hundreds of patients and therapists, along with
S R , or the kernel parameters. For the selection of struc- an additional dataset of interview transcripts between Larry
tures in S R , we use a greedy algorithm maximizing the King (host) and his guests 4 . For the purposes of evaluations
proposed mutual information lower bound (in Theorem 1); in this paper, we put all the pairs of patient/therapist responses
see the pseudo code in Alg. 3. We initialize a reference set from different sessions together in a single set, {Sip , Sit }N
i=1 ,
of small size I M , by randomly selecting responses of size N=42000. We perform random 90%-10% split of the
from the training set of patient/therapist responses, i.e. S̄ = dataset into a training (38,000 response pairs), and a test
{S1p , · · · , SNp
, S1t , · · · , SN
t
}; though, as noted before, the su- set (4200 response pairs), respectively. 5 Similarly, for the
perset S̄ for the random selection can be any set of sen- second dataset, we put together all pairs of host/guest re-
tences/paragraphs, not necessarily coming from a dataset of sponses from 75 sessions in a single set of size 8200. Further,
patient/therapist responses. First, each element in the initial we perform 10 trials for obtaining the statistics on evaluation
reference set is optimized greedily, and then more elements metrics in each experiment, with 95% response pairs sampled
are added one by one until the reference set size grows to from the training subset, and the same percent of pairs from
M . When optimizing each element in the set, for comput- the test subset.
ing the MI lower bound, we sample γ number of response Hashing Settings
pairs from the training set of patient/therapist responses pairs,
The number of hashcode bits is hundred, H = 100, for
{(Sip , Sit )}N i=1 . For computational efficiency, we adopt the both patient and therapist responses. To obtain hashcodes of
idea of sampling for the candidate set as well, in each greedy
responses, for each sentence in a textual response of a pa-
optimization step, by sampling a subset of candidates of size
tient/therapist, we obtain Part of Speech Tags as additional
β from the set S̄.
features, known as POS.
The computation cost in the optimization is dominated by In case of kernel-hashing, we use subsequence ker-
the number convolution kernel similarities, i.e. O(γ(M 2 + nels [Mooney and Bunescu, 2005] for computing similarity
M β)). In practice, we can keep low values of γ as well as
β; in our experiments, we use β = 1000, γ = 100, and vary 4
http://transcripts.cnn.com/TRANSCRIPTS/
the value of M from 30 upto 300. A similar procedure can be lkl.html
5
used to optimize kernel parameters. We use random seed value zero for numpy.random package.

30
Can you? They’re leaving, they’re leaving.
Yes.
When I took the 60, I didn’t sleep for like two days.
I feel like it is doing nothing.
Talk to me and listen to me.
Mm-hm.
Uh-huh.
No, I’m not about to tell him. Hum-um.
It was one of the few things in there that I actually bought from her. None of the things I think... She was trying hard to say the right
things. Where have I ever heard that from? So I leave the conversation and go, ”All right, well, we’ll see what happens.” And what
she asked in return is that when I’m angry at her, tell her. ”I’m angry at you. I’m not going to talk to you for a while now.” She said,
”As long as I know.”
Sure. I’ll see you day after tomorrow.
You’re not going any further in therapy right now because you haven’t decided which way you want to go. It really, to me — okay,
maybe I’m over playing it but it seems like a parting in the way. Which game are you going to play? Which game do you want to be
good at? And you keep pussyfooting saying well, I like a little bit of this and I like a little bit of that.
And I can’t breathe. I-I-I can still breathe, you know.
Right.
You’re not going to church or nothing —
Yeah, it’s basically the, um, the recordings are sort of a $50 subsidy –
Table 3: We show first 15 responses (in raw text format) that are selected greedily for the reference set (S R ) in kernel-RMM hashing model,
out of 76,000 responses from the train set, with Alg. 3.

between two responses 6 , while similarity between a pair of 32, 64. 8
words is computed using wordnet if their POS tags match.
For the size of reference set S R , we try values, 30, 100, Random Forests for Inferring Therapist Hashcodes
300 (α = 10). Reference set is initialized either randomly as a From the above model, we obtain hashcodes for patient and
subset of all the responses in the training set (Random Selec- therapist responses in the training and test subsets, serving
tion), or the sub-selection is optimized with Alg. 3 (MI LB as ground truth for the task of inferring the hashcode of the
Optimal), with configurations β = 1000, γ = 100, I = 15. therapist response, given the hashcode of a patient response.
See Tab. 3. We explore two different kernel locality sensi- For inferring each bit of therapist code, we train an individual
tive hashing techniques [Joly and Buisson, 2011; Garg et al., Random Forest (RF) classifier containing 100 decision trees.
2018b], referred as RMM 7 and RkNN respectively, while All of the hundred RF classifiers (since H = 100) share the
considering the latter one as our primary choice for detailed same features (i.e. the hashcodes of patient responses) though
analysis. trained independent of each other.
For the case of neural-hashing, we use all the responses
from the training set as the reference set of size 76,000; since 5.1 Evaluation Results
the reference set is very large, we use random subsets of Evaluation metrics Our primary evaluation metric is our
larger size for constructing hash functions, i.e. α = 50. We proposed MI lower bound (MI LB) in Theorem 1. We also
use an LSTM model to construct each hash function, trained use the proposed upper bound on the entropy of therapist
using Adams algorithm [Kingma and Ba, 2014], with learn- hashcodes in Theorem 2 to evaluate the quality of hashcodes
ing rate 1e-3, amsgrad=True, and l1 , l2 regularization coef- as representations of responses (Entropy UB). In addition,
ficients of value 1e-4, and the gradients are computed with we express the ratio of MI LB and Entropy UB as Normal-
back propagation. We initialize a word in a response and its ized MI LB (NMI LB). It is also an interesting to analyze
POS tag with a randomly initialized word vector of size 30; MI LB for random pairings between patient and therapist re-
for a single time step processing with in LSTM, word vec- sponses (denoted as “Shuffled”); same applies for NMI LB.
tors of 10 adjacent words, along with their POS tags, are ap- Interpreting the accuracy metric (Accuracy), i.e. the ac-
pended into a vector of size 600; this is required to avoid van- curacy of inferring each hashcode bit using the RF classi-
ishing gradients since patient responses can be of length up fiers, requires a more detailed discussion. Since the class la-
to 8,000 words in the training dataset. For the H number of bels (therapist hashcodes) are not fixed, the accuracy is in-
LSTM models as neural-hash functions, same neural archi- versely correlated with the Entropy UB metric. We also ob-
tecture, i.e., same number of layers and units, are used in each tain the baseline accuracy (Baseline), using a trivial classi-
model. When optimizing the architecture of the LSTM mod- fier that always chooses the most-frequent class label. While
els with Alg. 2 by maximizing our proposed MI LB (MI LB analyzing the absolute numbers of accuracy is not meaning-
Optimal), we add layers one by one greedily up to maximum ful here, we can see the relative improvement of accuracy
possible 4 layers (L = 4, γ = 1000), and try out different w.r.t. the accuracy of the baseline dummy classifier. The mean
possible numbers of normal units in each layer, i.e., 4, 8, 16, and standard deviation statistics for each metric are computed
over 10 runs of the experiment, as mentioned above; in the
6
λ = 0.8, subsequence of length up to 8 are matched case of the accuracy metric, the statistics are computed over
7
In this hashing technique, an SVM is learned for each hash func-
8
tion. We use C = 0.1 in SVM, keeping it highly regularized. We keep the number of units small, since α is small.

31
Run away and, and be by myself.
Um-hum.
(i) Uh-huh. (ii) Oh. (iii) That doesn’t pan out.
Yeah, we’ll be in touch about all that stuff.
Okay. So very nice to see you again.
(i) OK, I see. (ii) Kind of like that- (iii) Good, good. What’s new anything ?
Sensitive. Yeah. (inaudible) just like (inaudible) very uncomfortable.
Uh-huh.
(i) What’s going on with the moods? (ii) Big people don’t feel this way like? (iii) Scares you how?
Yeah. It just feels moderately depressing. My head spins
Do you feel depressed?
(i) So eventually it will die out. (ii) So it may peter out later. (iii) We can stop for now.
Well, am I kidding myself when I’m saying nothing’s bothering me? So that bothers me.
When you don’t - whether you can really trust your own feelings.
(i) To be at - all of us. (ii) I’m sure. (iii) That sounds exhausting, yeah.
(laughs) Um, yeah. (laughs)
Forever.
(i) OK, I see. (ii) Okay, I understand. (iii) All right, take it.
But it’s like that’s what I’ve been doing. And I don’t, I can’t scrape up the money to go back to school.
Yeah.
(i) Um-hum. (ii) Mm-hmm. (iii) Hmmm.
Table 4: Here, we present some examples of textual response choices produced with our hashing based approach (raw text format). For a
patient response (Black), we show the ground truth response from a human therapist (Blue), and three textual response choices generated with
kernel-RMM model (Magenta) via mapping of the inferred therapist hashcode to hashcode-indexed therapist responses from the training set.

all (100) hashcode bits and over 10 trials. For each of the met- the set rather than a random selection.
rics, higher values mean better performance. In addition, by
subtracting MI LB from Entropy UB, we obtain the inference For kernel-RMM, we use M = 100, obtaining similar re-
quality. sults as with kernel-RkNN hashing. Here it is interesting to
see that the ratio of MI LB (11.4) w.r.t. its baseline of MI LB
Depression Dataset Results for shuffled pairs (2.8), is significantly higher (4.1) than other
We present the results for our primary evaluation on the de- models. This ratio should serve as another metric derived
pression dataset in Tab. 2. Overall, our MI LB optimization from MI LBs, to judge the quality of a dialogue model (higher
based approach is outperforming the alternatives in most of value of the ratio is better).
the settings, for the MI LB as well as Entropy UB metrics
(best results shown in boldface). We will first discuss in more With an increase in M when selecting the reference set
detail our results for kernel-based hashing. randomly, we observe decrease in baseline accuracy since
the Entropy UB increases in value. Also, when optimiz-
ing the reference set instead of Random selection, Entropy
Kernel-Hashing Since the performance of a kernel- UB increases, and therefore the baseline accuracy decreases.
hashing model varies for a random initialization of S R , we Overall, when the difference between Entropy UB and MI
perform multiple trials in that case, and report numbers for LB (quantifying inference quality) is high, the positive gap
the trial which gives the median value for MI LB. For RkNN- between the accuracy numbers and the baseline accuracy
Kernel hashing, we vary sizes of the reference set S R , i.e. numbers is more significant. Variance for accuracy is high
M ∈ {30, 100, 300}. For M = 30, MI LB (-11.5) and En- because for some bits in hashcodes, the distribution of 0s and
tropy UB (5.7) values are low when selecting the reference set 1s is imbalanced (say, 99%-1%), making the inference very
elements randomly (Random Selection). On the other hand, if challenging for those bits.
the reference set is optimized (MI LB Optimal), MI LB (4.4)
and Entropy UB (30.1) are of much higher values, as desired. Herein, it is also interesting to observe MI LB values for
For M = 100, MI LB values increase significantly compared the “Shuffled” setting (presented in brackets), i.e. comput-
to the values for M = 30; thus high value of M is advanta- ing MI LB for random pairings of patient/therapist responses.
geous. We see the benefits of optimizing the reference set (MI This should serve as a good metric to analyze a hashing based
LB: 13.6, Entropy UB: 26.2), compared to the Random Se- dialogue model as well. Since we are modeling responses
lection case (MI LB: 9.8, Entropy UB: 19.9), for M = 100 from all therapy sessions put together into a single set, one
too. For a further increase in the size of the reference set, should expect that a good model should have low value of MI
M = 300, we obtain even higher values for MI LB (15.7), LB for the shuffled setting; though, it is natural to expect a
though not a significant increase in Entropy UB (24.0), espe- minimal value of mutual information even between randomly
cially when selecting the reference set randomly. In this set- paired responses of patient & therapist, from different ses-
ting, the advantage of optimizing the reference set is some- sions, due to the common theme of depression psychotherapy
what marginal. For robustness, it is always good to optimize across the sessions.

32
Hashing Func. Hash Config. Model MI LB (Shuffled) Entropy UB NMI LB (Shuffled) Accuracy (Baseline)
Kernel-RkNN M=100,α=10 Random Sel. 15.1 ± 0.5 (7.3 ± 0.2) 22.6 ± 0.4 0.67 (0.32) 87.7 ± 12.9 (81.8 ± 17.3)
Kernel-RkNN M=100,α=10 MI LB Opt. 16.8 ± 0.6 (7.9 ± 0.4) 25.3 ± 0.6 0.66 (0.31) 86.2 ± 13.8 (79.5 ± 17.2)
Kernel-RkNN M=300,α=10 Random Sel. 20.4 ± 0.4 (11.7 ± 0.2) 26.9 ± 0.5 0.76 (0.43) 88.3 ± 11.0 (80.7 ± 15.2)
Neural-RLSTM M=18000,α=50 MI LB Opt. 49.5 ± 0.3 (28.0 ± 0.4) 60.6 ± 0.2 0.82 (0.46) 69.1 ± 7.8 (55.8 ± 8.3)
Table 5: Evaluating the quality inference of dialogue responses (hashcodes), for the Larry King dataset.

Neural-Hashing depression therapy sessions between real patients and ther-
For the optimization of Neural-RLSTM hashing model, apists. We optimized locality sensitive hashing models, based
Alg. 2 gives a simple LSTM architecture of [16, 64, 16, 8], on kernel functions or neural language models, by maximiz-
i.e. four layers, with 16 LSTM units in the first layer (bot- ing the proposed MI lower bound as an objective function.
tom layer receiving the inputs), 64 LSTM units in the second Our results consistently demonstrate superior performance of
layer, and so on. For this model, we obtain relatively higher the proposed approach over several alternatives, as measured
values for MI LB (17.2), compared to the kernel-hashing by several evaluation metrics.
above. Besides the optimized architecture for LSTM mod-
els, we also tried manually built architectures; for instance, if References
using simple LSTM model of single layer with 10 units for
[Alemi et al., 2017] Alex Alemi, Ian Fischer, Josh Dillon,
each hash function, we obtain MI LB value, 6.7, that is sig-
nificantly lower compared to the optimized Neural-RLSTM and Kevin Murphy. Deep variational information bottle-
model (17.2), and many of the kernel-hashing models above. neck. 2017.
The above results demonstrate that our approach of opti- [Althoff et al., 2016] Tim Althoff, Kevin Clark, and Jure
mizing hashing functions by maximizing our novel MI LB Leskovec. Large-scale analysis of counseling conversa-
leads to higher values for the evaluation metrics as desired, tions: An application of natural language processing to
across the kernel- as well as neural-hashing models. There mental health. Transactions of the Association for Com-
is a trade off between kernel-hashing and neural-hashing. In putational Linguistics, 4:463, 2016.
kernel-hashing, as per the selected responses in the optimized [Arora et al., 2016] Sanjeev Arora, Yingyu Liang, and
reference set, one can get insights on what patterns are rele- Tengyu Ma. A simple but tough-to-beat baseline for sen-
vant for a dialog; for e.g., see Tab. 3. On the other hand with tence embeddings. 2016.
neural-hashing, one can take advantage of the vast variety of
neural language models available. [Asghar et al., 2017] Nabiha Asghar, Pascal Poupart, Xin
Further, we can map an inferred hashcode to textual re- Jiang, and Hang Li. Deep active learning for dialogue
sponses using a repository of indexed sentences. See Tab. 4. generation. In Proceedings of the 6th Joint Conference
on Lexical and Computational Semantics (* SEM 2017),
Larry King Dataset Results pages 78–83, 2017.
Some selected results for the Larry King dataset are presented [Barber and Agakov, 2003] David Barber and Felix Agakov.
in Tab. 5. Herein, it is interesting to see that we get very high The im algorithm: a variational approach to information
value of MI LB, and Entropy UB with our neural hashing maximization. In Proceedings of the 16th International
approach (optimized architecture is [8,32]), in comparison to Conference on Neural Information Processing Systems,
our kernel hashing. Also, mean accuracy with neural hash- pages 201–208. MIT Press, 2003.
ing is significantly higher than the baseline accuracy number.
Though, NMI LB values for the two hashing approaches are [Bartl and Spanakis, 2017] Alexander Bartl and Gerasimos
relatively close. Spanakis. A retrieval-based dialogue system utiliz-
ing utterance and context embeddings. arXiv preprint
arXiv:1710.05780, 2017.
6 Conclusions
[Chalk et al., 2016] Matthew Chalk, Olivier Marre, and
This paper introduces a novel approach to dialogue model- Gasper Tkacik. Relevant sparse codes with variational in-
ing based on hash functions, using psychotherapy sessions formation bottleneck. In Advances in Neural Information
as a motivating domain. In our framework, responses from Processing Systems, pages 1957–1965, 2016.
both parties (e.g., patient and therapist) are represented by
[Chen et al., 2016] Xi Chen, Yan Duan, Rein Houthooft,
the corresponding hashcodes, capturing certain text patterns.
Furthermore, we propose a novel lower bound on Mutual In- John Schulman, Ilya Sutskever, and Pieter Abbeel. In-
formation in order to characterize the relevance of a thera- fogan: Interpretable representation learning by informa-
pist’s response to the patient’s text, and vice versa. Moreover, tion maximizing generative adversarial nets. In Advances
in order to characterize the general quality of hashcodes as in Neural Information Processing Systems, pages 2172–
response representations, we propose a tight upper bound on 2180, 2016.
the joint entropy of hashcodes. We performed empirical eval- [Collins and Duffy, 2002] Michael Collins and Nigel Duffy.
uation of the proposed approach on the dataset containing Convolution kernels for natural language. In Advances

33
in neural information processing systems, pages 625–632, [Kulis and Grauman, 2012] Brian Kulis and Kristen Grau-
2002. man. Kernelized locality-sensitive hashing. IEEE Trans-
[Fitzpatrick et al., 2017] Kathleen Kara Fitzpatrick, Alison actions on Pattern Analysis and Machine Intelligence,
Darcy, and Molly Vierhile. Delivering cognitive behavior 34(6):1092–1104, 2012.
therapy to young adults with symptoms of depression and [Lewinsohn et al., 1990] Peter M Lewinsohn, Gregory N
anxiety using a fully automated conversational agent (woe- Clarke, Hyman Hops, and Judy Andrews. Cognitive-
bot): A randomized controlled trial. JMIR Mental Health, behavioral treatment for depressed adolescents. Behavior
4(2):e19, 2017. Therapy, 21(4):385–401, 1990.
[Gao et al., 2016] Shuyang Gao, Greg Ver Steeg, and Aram [Lewis et al., 2017] Mike Lewis, Denis Yarats, Yann N
Galstyan. Variational information maximization for fea- Dauphin, Devi Parikh, and Dhruv Batra. Deal or no
ture selection. In Advances in Neural Information Pro- deal? end-to-end learning for negotiation dialogues. arXiv
cessing Systems, pages 487–495, 2016. preprint arXiv:1706.05125, 2017.
[Garg et al., 2018a] Sahil Garg, Aram Galstyan, Greg Ver [Li and Jurafsky, 2016] Jiwei Li and Dan Jurafsky. Neural
Steeg, Irina Rish, Guillermo Cecchi, and Shuyang net models for open-domain discourse coherence. arXiv
Gao. Efficient representation for natural language preprint arXiv:1606.01545, 2016.
processing via kernelized hashcodes. arXiv preprint [Li et al., 2015] Jiwei Li, Michel Galley, Chris Brockett,
arXiv:1711.04044, 2018.
Jianfeng Gao, and Bill Dolan. A diversity-promoting ob-
[Garg et al., 2018b] Sahil Garg, Greg Ver Steeg, and Aram jective function for neural conversation models. arXiv
Galstyan. Stochastic learning of nonstationary ker- preprint arXiv:1510.03055, 2015.
nels for natural language modeling. arXiv preprint
[Li et al., 2017] Jiwei Li, Will Monroe, Tianlin Shi, Alan
arXiv:1801.03911, 2018.
Ritter, and Dan Jurafsky. Adversarial learning for neu-
[Haussler, 1999] David Haussler. Convolution kernels on ral dialogue generation. arXiv preprint arXiv:1701.06547,
discrete structures. Technical report, 1999. 2017.
[He et al., 2017] He He, Anusha Balakrishnan, Mihail Eric, [Liu et al., 2016] Chia-Wei Liu, Ryan Lowe, Iulian V Ser-
and Percy Liang. Learning symmetric collaborative dia- ban, Michael Noseworthy, Laurent Charlin, and Joelle
logue agents with dynamic knowledge graph embeddings. Pineau. How not to evaluate your dialogue system:
arXiv preprint arXiv:1704.07130, 2017. An empirical study of unsupervised evaluation met-
[Joly and Buisson, 2011] Alexis Joly and Olivier Buisson. rics for dialogue response generation. arXiv preprint
Random maximum margin hashing. In Computer Vision arXiv:1603.08023, 2016.
and Pattern Recognition (CVPR), 2011 IEEE Conference [Lowe et al., 2017] Ryan Lowe, Michael Noseworthy, Iu-
on, pages 873–880. IEEE, 2011. lian V Serban, Nicolas Angelard-Gontier, Yoshua Ben-
[Jurafsky and Martin, 2014] D Jurafsky and JH Martin. Di- gio, and Joelle Pineau. Towards an automatic turing test:
alog systems and chatbots. Speech and language process- Learning to evaluate dialogue responses. arXiv preprint
ing, 3, 2014. arXiv:1708.07149, 2017.
[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. [Mooney and Bunescu, 2005] Raymond J Mooney and Raz-
Adam: A method for stochastic optimization. arXiv van C Bunescu. Subsequence kernels for relation extrac-
preprint arXiv:1412.6980, 2014. tion. In Proc. of NIPS, pages 171–178, 2005.
[Komorowski and Trzcinski, 2017] Michal Komorowski and [Norouzi et al., 2014] Mohammad Norouzi, Ali Punjani, and
Tomasz Trzcinski. Random binary trees for approximate David J Fleet. Fast exact search in hamming space with
nearest neighbour search in binary space. arXiv preprint multi-index hashing. IEEE transactions on pattern analy-
arXiv:1708.02976, 2017. sis and machine intelligence, 36(6):1107–1119, 2014.
[Kosovan et al., ] Sofiia Kosovan, Jens Lehmann, and Asja [Papineni et al., 2002] Kishore Papineni, Salim Roukos,
Fischer. Dialogue response generation using neural net- Todd Ward, and Wei-Jing Zhu. Bleu: a method for auto-
works with attention and background knowledge. matic evaluation of machine translation. In Proceedings of
[Kraskov et al., 2004] Alexander Kraskov, Harald the 40th annual meeting on association for computational
Stögbauer, and Peter Grassberger. Estimating mutual linguistics, pages 311–318. Association for Computational
information. Physical Review E, 69:066138, 2004. Linguistics, 2002.
[Kraskov et al., 2005] Alexander Kraskov, Harald [Serban et al., 2015] Iulian Vlad Serban, Ryan Lowe, Peter
Stögbauer, Ralph G Andrzejak, and Peter Grassberger. Henderson, Laurent Charlin, and Joelle Pineau. A survey
Hierarchical clustering using mutual information. EPL of available corpora for building data-driven dialogue sys-
(Europhysics Letters), 70(2):278, 2005. tems. arXiv preprint arXiv:1512.05742, 2015.
[Kulis and Grauman, 2009] Brian Kulis and Kristen Grau- [Serban et al., 2016] Iulian Vlad Serban, Ryan Lowe, Lau-
man. Kernelized locality-sensitive hashing for scalable rent Charlin, and Joelle Pineau. Generative deep neural
image search. In Computer Vision, 2009 IEEE 12th Inter- networks for dialogue: A short review. arXiv preprint
national Conference on, pages 2130–2137. IEEE, 2009. arXiv:1611.06216, 2016.

34
[Serban et al., 2017a] Iulian Vlad Serban, Tim Klinger, Ger- A Derivations of the Information Theoretic
ald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bounds
Bengio, and Aaron C Courville. Multiresolution recurrent
neural networks: An application to dialogue response gen- Before the discussion of our novel lower bound, we intro-
eration. In AAAI, pages 3288–3294, 2017. duce the information-theoretic quantity called Total Correla-
tion (T C), which captures non-linear correlation among the
[Serban et al., 2017b] Iulian Vlad Serban, Alessandro Sor- dimensions of a random variable X , i.e.,
doni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C X
Courville, and Yoshua Bengio. A hierarchical latent vari- T C(X ; Mh ) = H(X (j); Mh ) − H(X ; Mh );
able encoder-decoder model for generating dialogues. In j
AAAI, pages 3295–3301, 2017. T C(X : Y; Mh ) = T C(X ; Mh ) − T C(X |Y; Mh ). (5)
[Shao et al., 2017] Yuanlong Shao, Stephan Gouws, Denny Intuitively, (5) describes the amount of information within X
Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. that can be explained by Y.
Generating high-quality and informative conversation re- Along these lines, the mutual information quantity between
sponses with sequence-to-sequence models. In Proceed- the hashcodes can be decomposed as in Lemma 1 below.
ings of the 2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2200–2209, 2017. Lemma 1 (Mutual Information Decomposition). Mutual In-
formation between Ct and Cp is decomposed as follows:
[Studenỳ and Vejnarová, 1998] Milan Studenỳ and Jirina
Vejnarová. The multiinformation function as a tool for I(Ct : Cp ; Mh )
X
measuring stochastic dependence. In Learning in graphi- = I(Ct (j) : Cp ; Mh ) − T C(Ct : Cp ; Mh ). (6)
cal models, pages 261–297. Springer, 1998. j

[Ver Steeg and Galstyan, 2014] Greg Ver Steeg and Aram Looking at the first term of RHS in (6), it is the mutual in-
Galstyan. Discovering structure in high-dimensional data formation between a one-dimensional and multi-dimensional
through correlation explanation. In Advances in Neural random variable.
Information Processing Systems 27. 2014. For these terms, since one of the variables is only 1-D, we
[Wang et al., 2014] Jingdong Wang, Heng Tao Shen, can use the existing technique of variational bounds for an
approximation [Gao et al., 2016], as in Lemma 2 below.
Jingkuan Song, and Jianqiu Ji. Hashing for similarity
search: A survey. arXiv preprint arXiv:1408.2927, 2014. Lemma 2. Marginal mutual information for each bit in ther-
apist hashcodes, I(Ct (j) : Cp ; Mh ), is lower bounded as,
[Wang et al., 2017] Jingdong Wang, Ting Zhang, Nicu Sebe,
Heng Tao Shen, et al. A survey on learning to hash. I(Ct (j) : Cp ; Mh ) ≥
IEEE Transactions on Pattern Analysis and Machine In- H(Ct (j); Mh )+hlog q(Ct (j)|Cp ; Mh )ip(Ct (j),Cp ;Mh ) . (7)
telligence, 2017.
[Watanabe, 1960] Satosi Watanabe. Information theoretical
analysis of multivariate correlation. IBM Journal of re- Herein, H(Ct (j); Mh ) is easy to compute because Ct (j)
search and development, 4(1):66–82, 1960. is one-dimensional. For each of the proposal distributions
q(Ct (j)|Cp ; Mh ), we propose to use a Random Forest (RF)
[Wen et al., 2016] Tsung-Hsien Wen, David Vandyke, classifier [Garg et al., 2018a].
Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, In reference to the second term of RHS in (6), it is compu-
Pei-Hao Su, Stefan Ultes, and Steve Young. A network- tationally intractable to compute the total correlation expres-
based end-to-end trainable task-oriented dialogue system. sion T C(Ct : Cp ; Mh ), which denotes the total correlations
arXiv preprint arXiv:1604.04562, 2016. between bits of Ct , explainable by Cp . So, we would also like
to obtain an upper bound of T C(Ct : Cp ; Mh ), which is cheap
[Wieting et al., 2015] John Wieting, Mohit Bansal, Kevin to compute, that would give us a lower bound for the second
Gimpel, and Karen Livescu. Towards universal term in (6) because of the negative sign.
paraphrastic sentence embeddings. arXiv preprint
arXiv:1511.08198, 2015. Lemma 3. T C(Ct : Cp ; Mh ) can be upper bounded as:

[Wu et al., 2017] Yu Wu, Wei Wu, Chen Xing, Ming Zhou, T C(Ct : Cp ; Mh ) ≤ T C(Ct : Y ∗ ; Mh ); (8)
and Zhoujun Li. Sequential matching network: A new wherein |.| denotes the dimensionality of a random variable.
architecture for multi-turn response selection in retrieval-
based chatbots. In Proceedings of the 55th Annual Meeting Although it is intractable to compute the original term
of the Association for Computational Linguistics (Volume T C(Ct : Cp ; Mh ), it is possible to compute T C(Ct :
1: Long Papers), volume 1, pages 496–505, 2017. Y ∗ ; Mh ) for a latent variable representation Y ∗ of Ct that
maximally explains the Total Correlations in Ct .
[Zhai and Williams, 2014] Ke Zhai and Jason D Williams. We can think of the computation of the upper bound as an
Discovering latent structure in task-oriented dialogues. In unsupervised learning problem. We propose to use an exist-
ACL (1), pages 36–46, 2014. ing algorithm, CorEx, for the unsupervised learning of latent

35
random variables representation Y ∗ [Ver Steeg and Galstyan,
2014].
It is important to note some practical considerations about
the upper bound. In the case of a suboptimal solution to the
maximization of T C(Ct : Y; Mh ) above, the optimized quan-
tity may not be an upper bound of T C(Ct : Cp ; Mh ), but
rather an approximation. Also, the upper bound would not
be tight if Cp doesn’t explain much of total correlations in
Ct . Further, for even more computation cost reductions dur-
ing the learning, the dimension of the latent representation Y
can be kept much smaller than the dimension of hashcodes,
i.e. |Y| |Cp | for |Cp | 1; this is because even a small
number of latent variables should explain most of the total
correlations for practical purposes as demonstrated in [Ver
Steeg and Galstyan, 2014], and observed in our experiments
on hashcodes as well.
Combining (7) and (8) into (6), we get the lower bound in
(3) in Theorem 1.
Along same lines, we derive the tight upper bound on joint
entropy of hashcodes in Theorem 2. From the definition of
Total Correlation above (5), we have the following,
X
H(Ct (j); Mh ) − T C(Ct ; Mh ) = H(Ct ; Mh ),
j

T C(Ct ; Mh ) = T C(Ct : Y ∗ ; Mh ) + T C(Ct |Y ∗ ; Mh ),
and finally the expression below.
X
H(Ct (j); Mh ) − T C(Ct : Y ∗ ; Mh )
j

= H(Ct ; Mh ) + T C(Ct |Y ∗ ; Mh )
From this derived expression, we can simply obtain the upper
bound and the corresponding gap.
Previous Lower Bounds for Mutual Information: Vari-
ational lower bounds on the mutual information criterion
have been proposed in the past [Barber and Agakov, 2003;
Chalk et al., 2016; Gao et al., 2016; Chen et al., 2016;
Alemi et al., 2017; Garg et al., 2018a]. Their lower bounds
works only when one of the variables is fixed, say if Ct were
fixed. In our objective, not only Ct is a functional of the hash-
ing model that we are learning, it is high dimensional. Un-
less we have a lower bound for the entropy term H(Ct ; Mh )
as well, which should be hard to obtain, we can not use the
above mentioned variational lower bounds for our problem
as such. Besides, it is also non-trivial to find an appropri-
ate proposal distribution q(Ct |Cp ; Mh ). Therefore, we adopt
a different approach for obtaining a novel lower bound on the
mutual information quantity, as described above.