<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dialogue Modeling Via Hash Functions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sahil Garg</string-name>
          <email>sahil.garg.cs@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillermo Cecchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irina Rish</string-name>
          <email>rishg@us.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuyang Gao</string-name>
          <email>sgao@isi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Greg Ver Steeg</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarik Ghazarian</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Palash Goyal</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aram Galstyan</string-name>
          <email>galstyang@isi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Thomas J. Watson Research Center</institution>
          ,
          <addr-line>Yorktown Heights, NY</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>USC Information Sciences Institute</institution>
          ,
          <addr-line>Marina del Rey, CA</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>24</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>We propose a novel machine-learning framework for dialogue modeling which uses representations based on hash functions. More specifically, each person's response is represented by a binary hashcode where each bit reflects presence or absence of a certain text pattern in the response. Hashcodes serve as compressed text representations, allowing for efficient similarity search. Moreover, hashcode of one person's response can be used as a feature vector for predicting the hashcode representing another person's response. The proposed hashing model of dialogue is obtained by maximizing a novel lower bound on the mutual information between the hashcodes of consecutive responses. We apply our approach in psychotherapy domain evaluating its effectiveness on a reallife dataset consisting of therapy sessions with patients suffering from depression; in addition, we also model transcripts of interview sessions between Larry King (television host) and his guests.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Dialogue modeling and generation is an area of active
research, and of great practical importance, as it provides a
basis for building successful conversational agents in a wide
range of applications. While an open-domain dialogue
remains a challenging open problem, developing dialogue
systems for particular applications can be more tractable, due to
specific properties of the application domain.</p>
      <p>A motivating application for our work is (semi-)automated
psychotherapy: easily accessible, round-the-clock
psychotherapeutic services provided by a conversational agent.
According to recent estimates, mental health disorders
affect one in four adult Americans, one in six adolescents, and
one in eight children. Furtermore, as predicted by the World
Health Organization, by 2030 the amount of worldwide
disability and life loss attributable to depression may become
greater than for any other condition, including cancer, stroke,
heart disease, accidents, and war. However, many people do
not receive an adequate treatment; one of the major factors
here is limited availability of mental health professionals, as</p>
      <p>Its all my fault. It’s like getting hit in the stomach And trying
to catch your breath. You can’t catch it. You struggle, move
around, You just can’t get your breath. That just really knocks
me off balance. You begin to generate this feeling of a kind
of negativism. And that’s why it really being hurt. I never
allowed myself to think negative, accept everybody else. I am
the only one giving back to the ”be quiet time.” I had to have
something to make do with that.
0101001111001011111110101101011110011011100011101
OK. Maybe that fits into the whole pattern then.</p>
      <p>0111001101001011111100011111111110001010110011001
compared to the number of potential patients. Thus,
creating automated systems providing mental health services is
becoming an increasingly important area of research and
development; however, existing conversational systems in this
domain are still primarily rule-based, and lack sophisticated
dialogue modeling and generation capabilities based on
actual text understanding.</p>
      <p>An interesting feature of therapeutic dialogues appears to
be a classical pattern of a relatively long monologue of a
patient followed by a considerably shorter response of a
therapist, e.g., a typical example is shown in Tab. 1 (taken from
a dataset we experimented with, containing multiple
transcribed depression therapy sessions.) Moreover, therapist’s
responses tend to be rather high-level, generic statements,
typically summarizing/confirming the patient’s response, and
can be viewed as a “category label” being “assigned” to a
patient’s text “sample”. Generating such a short, more
constrained response can be a somewhat easier task than solving
the open-domain dialogue challenge. Note that, rather than
generating a specific response sentence directly, a more
effective approach can be to first infer a conceptual, high-level
representation reflecting the essence of the “most relevant”
response to the patient.</p>
      <p>Motivated by above considerations, we introduce here a
novel dialogue modeling framework where responses are
represented as locality-sensitive binary hashcodes [Kulis and
Grauman, 2009; Joly and Buisson, 2011; Garg et al., 2018b].
The motivation behind such approach includes several
considerations: (a) it can be easier to infer a generalized,
interpretable representation (e.g., binary hascode) of a response
(along with the inverse mapping) instead of a specific textual
response (due to potentially smaller search space); (b)
hashcodes of patient responses can be used as feature vectors to
infer the hashcodes of therapist responses (multiple binary
class labels), as previoiusly demonstrated for an
informationextraction task [Garg et al., 2018a]; (c) an inferred therapist’s
hashcode can be efficiently mapped to multiple text sentences
(possible choices for the therapist response). See Tab. 1.</p>
      <p>While the above approach was motivated by psychotherapy
domain, it is generally applicable to a wide range of other
domains involving dialogue modeling and generation. The key
contributions of this work include:
a novel, generic framework for modeling a dialogue
using locality sensitive hash functions (based on kernel
similarities or neural networks, as discussed later) to
represent textual responses;
a novel lower bound on the Mutual Information (MI)
between the hashcodes of the responses from the two
agents (capturing relevance between the two), used as
an optimization criterion to optimize a hashing model;
an approach using the MI lower bound as a metric for
joint evaluation of the representation quality of
hashcodes, and the quality of inference. Also, a tight upper
bound on joint entropy is derived, to separately
characterize the quality of hashcodes as representations.
Empirical evaluation of the proposed framework and the
optimization approach, which uses both kernels and neural
networks for locality sensitive hash functions, is performed on
a dataset containing textual transcriptions of 400
depressiontherapy sessions conducted by human therapists with real
patients. We find that our approach of optimizing hashing by
maximizing our novel MI lower bound provides a consistent
improvement over alternative types of hashing approaches
(using both kernels and neural nets), in terms of the proposed
evaluation metrics. Further, general applicability of our work
is demonstrated by experiments on another dataset of
transcriptions of 75 interview sessions between television host
Larry King and his guests.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Dialogue agents, e.g. chat bots, are becoming increasingly
popular in various applications, including (semi)-automated
psychotherapy, for example, based on popular techniques
such as CBT
        <xref ref-type="bibr" rid="ref24">(Cognitive Behavioral Therapy [Lewinsohn et
al., 1990])</xref>
        ; however, these agents have very limited
abilities to actually understand free text responses from their
users; instead, they are typically offering a fixed set of
preprogrammed responses to choose from [Fitzpatrick et al.,
2017]. See [Jurafsky and Martin, 2014] for an overview.
      </p>
      <p>There are several neural networks based approaches
proposed in the recent past for dialogue modeling in general
domain [Serban et al., 2015; 2016; 2017a; 2017b; Shao et
al., 2017; Asghar et al., 2017; Wu et al., 2017]. However,
the setting is somewhat different from the therapy dialogues
where the patient responses can be extremely long (up to tens
of thousands of words). Also, evaluating the effectiveness of
the therapist’s response requires some notion of relevance
which goes beyond the standard measures of its semantic
features [Papineni et al., 2002; Liu et al., 2016; Li and Jurafsky,
2016; Lowe et al., 2017; Li et al., 2017]; we consider here an
information-theoretic approach to capture this notion. Also,
therapy dialogue modeling and generation has some
similarities with the task-driven dialogue [Zhai and Williams, 2014;
Wen et al., 2016; Althoff et al., 2016; Lewis et al., 2017;
He et al., 2017], although evaluating effectiveness of a
therapeutic dialogue may be more challenging as the effect is not
always immediate. Attention to specific parts of the response,
as well as background knowledge, explored in neural network
based dialogue modeling [Kosovan et al., ] can be quite
helpful in therapeutic dialogues; those aspects are, to some extent,
implicitly captured by learning the hash models. Note that in
related work by [Bartl and Spanakis, 2017], hashcodes are
not treated as representations of the responses, and used only
for the nearest neighbor search, unlike the approach proposed
here.
Require: Reference set of structures SR, of size M ; sub-sampling parameter, , for randomizing hash function; the number of hash
functions, H. % Random subsets, fixed across structure inputs
1: z f0 ; 1 g % labels for hash code bit
2: for j = 1 ! H do
3: rjh randomSubset(M , 2 ) % Random subsampling of structures for building a randomized locality sensitive hash function
4:</p>
      <p>Hj TrainBinaryClassifierAsHashFunction SR(rjh); z % Train a binary classifier, on the random subset of structures SR(rjh)
with their randomly assigned hash labels z, either using a convolution kernel similarity (subsequence/path/graph kernels) based
classification model such as, SVM, kNN, or a neural language model, such as LSTM, GRU, serving as a binary hash function
5: end for</p>
      <p>The idea of maximizing mutual information objective for
dialogue modeling has been considered previously by [Li et
al., 2015], though it was only used in test setting, rather than
as a guide for training.</p>
      <p>An evaluation metric such as BLEU score [Papineni et al.,
2002] may not be the best for our application, it tries to
capture all information in text when comparing an inference w.r.t.
the ground truth, rather than evaluate the relevance of one
response to the other. We propose several evaluation metrics
based on binary hashcodes themselves, since the latter are
assumed to serve as efficient representations of text in our
dialogue context.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Dialogue Modeling via Binary Hashing</title>
      <p>Here we discuss our novel framework for modeling a
dialogue between two agents using binary hash functions. In our
formulation below, we refer to the two agents as a patient and
a therapist, respectively, although the approach can be clearly
applied to a variety of other dialogue settings.
3.1</p>
      <sec id="sec-3-1">
        <title>Problem Formulation</title>
        <p>In a dialogue session between a patient and a therapist, we
denote ith response from the patient as Sip, and the
corresponding response from the therapist as Sit; in a session,
p
we have Spt = fSi ; SitgiN=1, with additional notations for
sets, S = fS1p; ; SNp ; S1t ; ; SNt g, Sp = fS1p; ; SNp g,
St = fS1t ; ; SNt g. In this work, we consider a response
Si, be it from the patient or the therapist in a session, as a
natural language structure which can be simply plain text,
or text along with part of speech tags (PoS), or
syntactic/semantic parsing of the text. As per a typical dialogue
modeling setup, the task would be to predict Sit given Sip,
i.e. generating therapist responses. However, we propose a
different setup. We propose to encode each response of a
patient/therapist (Sip/Sit) in a therapy session as a binary
hashcode (cip=cit 2 f0; 1gH ), and focus upon the problem of
inferring the therapist hashcode (cit) given the hashcode (cip) of
the patient response (Sip), before finally mapping the inferred
therapist hashcode (cit ) to multiple text response choices.
Note that cit is the hashcode representation of the ground
truth therapist response Sit (so it is the groundtruth itself),
iwtshehraesahsccoitdeicsipth,eanindfenroenkcneogwilveedngoenolyf Sthite; kanllotwheledhgasehocfoSdeips,
are generated using the same hashing model Mh. See Tab. 1
for examples of pairs of patient/therapist responses and their
corresponding hashcode representations.</p>
        <p>Note that, herein, we keep problem setting simple,
considering only the previous response of the patient to make
an inference of the therapist response, although the proposed
approach of modeling responses via hash functions is generic
enough for more sophisticated dialogue modeling setups such
as the ones using reinforcement learning.</p>
        <p>A hashcode for a dialogue response can be thought of as
a generalized representation of the response such that the
binary bit values in the hashcode denote existence of
important patterns in a response which are relevant for the
therapeutic dialogue between the patient and the therapist whereas
the rest in the response is effectively noise that should be
irrelevant from psychotherapy perspective. The additional
advantage of modeling responses as hashcodes is that we can
map a binary hashcode to multiple text responses efficiently,
i.e. generating therapist responses as per the inferred therapist
hashcode.</p>
        <p>Next, we describe our approach for generating locality
sensitive hashcodes of dialogue responses, using either kernel
similarity functions, or neural models such as LSTM-s.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Locality Sensitive Hashcodes of Responses</title>
        <p>using Kernel Functions or Neural Language</p>
      </sec>
      <sec id="sec-3-3">
        <title>Models</title>
        <p>The main idea behind locality sensitive hashing is that data
points that are similar to each other should be assigned
hashcodes with minimal Hamming distance to each other, and vice
versa. A similarity/distance function need not to be defined
explicitly; see [Wang et al., 2014] for details.</p>
        <p>Since locality sensitive hashing ensures that natural
language structures, that are assigned hashcodes with low
hamming distance to each other, are similar to each other,
locality sensitive hashcodes should serve as generalized
representations of language structures (a similarity/distance function
implied as per the locality sensitive hashing model learned for
a given task) 1, and so for the responses in a dialogue.</p>
        <p>Many hash functions have been proven to be locality
sensitive, with rigorous mathematical proofs [Wang et al., 2017].
However, the literature on data driven locality sensitive hash
functions is recent, such as based on kernel similarity
functions [Kulis and Grauman, 2009; Joly and Buisson, 2011;
Garg et al., 2018b]; see Fig. 1 for a demonstration of
locality sensitive hashing of dialogue responses. Although these
1A similarity function doesn’t imply a semantic similarity
function here. For instance, as per a learned locality sensitive
hashing model, the implied (valid) similarity function may account for
matching of only certain patterns in textual responses.
works lack theoretical guaranties for locality sensitivity of
hash functions, the common intuitive principle in these
approaches is to learn a randomized binary hash function,
constructed using a very small subset of available data points.
For instance, in [Joly and Buisson, 2011], a kernelized hash
function is built by learning a (random) maximum margin
boundary (SVM) that discriminates between the samples of
two small random subsets of a super set; same principle is
followed in [Garg et al., 2018b] for constructing a random
hash function, i.e. obtaining a k-Nearest Neighbors based
discrimination boundary between the two random subsets using
kernel functions, instead of a maximum margin boundary. In
[Kulis and Grauman, 2009; 2012], a hash function is built
using the union of two random subsets, that is an
(approximately) random linear hyperplane in the kernel implied
feature space. Despite the lack of theoretical guaranties for
locality sensitivity of kernel based hash functions, the principle
of constructing each random hash function by learning it on a
very small random subset of samples works well in practice.</p>
        <p>Considering the commonality in these hashing methods
due to the principle of random sub-sampling, we present a
generalization in Alg. 1 for generating locality sensitive
hashcodes, that is applicable for building a random hash
function, operating on language structures, either (i) using a
convolution kernel, K(Si; Sj ), defining similarity between two
structures Si, Sj [Mooney and Bunescu, 2005; Collins and
Duffy, 2002; Haussler, 1999], as proposed in the previous
works [Kulis and Grauman, 2009; Joly and Buisson, 2011;
Garg et al., 2018b], or as we additionally propose in this
paper, (ii) using a neural language model such as LSTM, GRU,
etc.</p>
        <p>In Alg. 1, the pseudo code is generically applicable for
obtaining hashcode of any language structure, including
responses from a dialogue. We have a set of language
structures, SR of size M , with no class labels. For building jth
hash function, we take a random subset SR(rjh) of size 2 ,
s.t. 1 &lt; M , from the set SR; this subset is partitioned
into two subsets, each of size , through the assignment of
artificially generated hash bit labels z. Having the two small
subsets, we can learn a binary classifier (preferably
regularized), based on kernel similarities or a neural language model,
that discriminates between the two subsets, acting as a hash
function. Following this principle, we can learn H number of
hash functions, with each one taking constant number of
computations for learning since the size of the subset SR(rjh) is
a small constant, 2 M .</p>
        <p>When building the hash functions using kernel similarity
functions, the size M of the reference set SR should be small.
This is because generation of a kernelized hashcode for an
input structure Si using the locality sensitive hashing
algorithm, requires computing convolution kernel similarities of
Si with all the elements of the reference set SR. A hashcode
ci for a structure Si represents finding important
substructures in it related to the set SR. This means, when
generating hashcodes of dialogue responses as representations in our
framework, we can control what patterns should be attended
by the therapist in the patient responses through the
optimization of the reference set itself. This idea can be very powerful
when modeling dialogues involving long responses from
patients. In Sec. 4, we propose to optimize a reference set using
a novel algorithm. 2</p>
        <p>On the other hand, when constructing a parametric hash
function by learning a binary classifier based on a neural
language model as proposed above, we don’t need to restrict the
size of reference set, SR. Instead, we propose to optimize
2Note that the similarity function, implied from a kernel locality
sensitive hashing model, may not be same as the kernel function.
While the kernel function may be accounting for similarity of all
possible patterns in textual responses, the similarity function would
be effectively matching only certain patterns, defined as per the
reference set in the hashing model.
the architecture of all the H number of neural models,
corresponding to H hash functions, jointly with our novel
algorithm in Sec. 4 while learning the weight parameters of the
neural models independently of each other.
While we propose to represent patient/therapist responses as
hashcodes, and infer the hashcode for a therapist response
instead of directly inferring the textual response, we can map
an inferred therapist hashcode to multiple choices for the
final textual response. As mentioned previously, one of the
main advantages of using binary hash functions to build
representations of responses is that one can use the same binary
hashing model for indexing a repository of sentences. For
the task of building an automated therapist (assisting a
human therapist), from a training set of therapy sessions, we
can put all the responses from therapists into a single
repository. Then, having a hashing model as per Alg. 1, all the
responses can be indexed, with binary hashcodes. Now, after
inferring a therapist hashcode, we can search in the repository
of hashcode-indexed responses, to find multiple responses
having minimal hamming distance w.r.t. the inferred
hashcode. Even if the number of hashcode bits were large,
sublinear time algorithms can be used for efficient similarity search
(nearest neighbors) in hamming spaces [Norouzi et al., 2014;
Komorowski and Trzcinski, 2017]. For e.g., see Tab. 4.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Learning Hashing for Dialogue Modeling with Mutual Information Maximization</title>
      <p>In the previous section, we introduced a framework for
modeling dialogues via hash functions, especially in the context of
a therapeutic dialogue between a therapist and a patient. The
question is what objective function to use for the optimization
of the hashing model? In the following, we introduce an
objective function that is suitable for optimizing hash functions
for dialogue modeling. The two main criteria for selecting
this objective function are as follows: (1) it should
characterize the quality of hashcodes as generalized representations
of dialogue responses; (2) it should also account for the
inference of therapist hashcodes, i.e. therapist dialogue generation.
Although the primary motivation of the present work is
therapeutic dialogue modeling, we note that the above criteria also
apply to more general dialogue modeling problems as well.
4.1</p>
      <sec id="sec-4-1">
        <title>Objective Function Formulation to Optimize</title>
      </sec>
      <sec id="sec-4-2">
        <title>Hashing for Dialogue Modeling</title>
        <p>We propose that maximizing Mutual Information between the
hashcodes of patient and the corresponding therapist dialogue
responses is suitable to optimize on both the inference
accuracy as well as the representation quality.</p>
        <p>arg max I(Cp : Ct; Mh)</p>
        <p>Mh
I(Cp : Ct; Mh) = H(Ct; Mh)
|representation}
{z</p>
        <p>H(CtjCp; Mh)
| }
infe{rzence
(1)
(2)</p>
        <p>Algorithm 2 Algorithm for Optimizing Neural Architecture in a
Neural-Hashing Model for Dialogue Modeling by Maximizing Our
MI LB
3:
4:
5:
6:
7:
8:
9:
Require: Train sets, Sp, St; maximum number of layers in neural
language models, L; the number of samples for computing the
MI lower bound, ; values for units in a layer, u = f5, 10, 20,
40, Noneg.</p>
        <p>% optimizing up to L layers greedily
1: for j = 1 ! L do
2: rlb randomSubset(N , ) % subset of patient/therapist
responses pairs for computing the MI LB to optimize jth layer
% Number of units in jth layer, None not applicable for 1st
layer
for l 2 u do
n(j) l % l units in jth layer of neural network
Cp computeHashcodes(Sp(rlb); n)
Ct computeHashcodes(St(rlb; n))</p>
        <p>computeMILowerBound(Cp; Ct)
milb(l)
end for
n(j) maxMILBIndex(milb, u) % choose the units with
maximum value of MI LB
10: if n(j) is None then break out of loop end if
11: end for
12: return n
Herein, Mh denotes a hashing model, such as the one
described above; Cp is the distribution on the hashcodes of
patient dialogue responses, and Ct characterizes the distribution
on the hashcodes of the corresponding therapist responses.
While minimizing the conditional entropy, H(CtjCp; Mh), is
to improve the accuracy for the inference of the therapist
response hashcodes, maximizing the entropy term, H(Ct; Mh),
should ensure good quality of the hashcodes as generalized
representations. See Fig. 2 for an illustration.</p>
        <p>Computing mutual information between two
highdimensional variables can be expensive, potentially
inaccurate if the number of samples is small [Kraskov et al., 2004].
So, we propose to optimize the hashing model, by
maximizing a lower bound on the mutual information criterion.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.2 Information Theoretic Bounds for Optimizing</title>
        <p>and Evaluating Hashing for Dialogue</p>
      </sec>
      <sec id="sec-4-4">
        <title>Modeling</title>
        <p>We develop a novel lower bound on the mutual
information criterion, that is cheaper and more accurately computable
than the criterion itself.</p>
        <p>Theorem 1 (Lower Bound for Mutual Information).
Mutual information between two hashcode distributions, I(Ct :
Cp; Mh), is lower bounded as,</p>
        <p>X H(Ct(j); Mh)
j</p>
        <p>T C(Ct : Y ; Mh)
hlog q(Ct(j)jCp; Mh)ip(Ct(j);Cp;Mh):
I(Ct : Cp; Mh)
+ X
j</p>
        <p>Y
Herein, T C(Ct : Y; Mh) describes the amount of Total
Correlations (Multi-variate Mutual Information) 3 within Ct that</p>
        <p>3The “total correlation” quantity, also called as multi-variate
Mutual Information, or multi-information, was defined in [Watanabe,
1960; Studeny` and Vejnarova´, 1998; Kraskov et al., 2005].</p>
        <p>(3)
(4)
Algorithm 3 Algorithm for Optimizing Reference Set in Kernelized Hashing for Dialogue Modeling by Maximizing Our MI LB
and
are the number of samples, as
can be explained by a latent variables representation Y;
q(Ct(j)jCp; Mh) is a proposal conditional distribution for
the jth bit of therapist hashcodes built using a classifier, like
a random forest, neural network, etc.</p>
        <p>An interesting aspect of the quantity T C(Ct : Y; Mh) is
that one can compute it efficiently for optimized Y that
explains maximum possible Total Correlations present in Ct,
see [Ver Steeg and Galstyan, 2014] for more details on the
optimization; for practical purposes, the dimension of latent
representation Y can be kept much smaller than the
dimension of hashcodes, i.e. jY j jCpj for jCpj 1.</p>
        <p>As we observed for the mutual information criterion above,
we can also see different terms in the mutual information
lower bound having varying roles for the optimization; for
instance, the first two terms in (3) contribute to improve the
quality of hashcodes as representations, i.e. maximizing
entropy of each hashcode bit while discouraging redundancies
between the bits, and the last term of conditional entropies is
for improving inference of hashcode bits individually.</p>
        <p>We argue to use the proposed MI lower bound, for not
only optimizing hashing for dialogue modeling, but also as
an evaluation metric quantifying the representation quality as
well as the inference of hashcodes; besides, it serves to
quantify what is mutually relevant between the responses of two
dialog agents.</p>
        <p>Further, for evaluating representation quality of hashcodes
separately, in Theorem 2 below, we present a novel tight
upper bound for joint entropy of hashcodes.</p>
        <p>Theorem 2 (Upper Bound for Entropy). Joint entropy of
hashcodes, H(Ct; Mh), is upper bounded as,</p>
        <p>H(Ct; Mh)</p>
        <p>X H(Ct(j); Mh)
j</p>
        <p>Note the similarities between the expressions for proposed
the mutual information lower bound, and the joint entropy
upper bound; the first two terms match between the two
expressions.</p>
        <p>For T C(CtjY ; Mh) = 0, i.e. when a latent
representation Y is learned which explains all the total correlations
in Ct, the upper bound becomes equal to the entropy term;
practically, for the case of hashcodes, learning such a
representation should not be too difficult, so the bound should be
very tight. This is quite interesting as we are able to judge
the representation quality of hashcodes, that we proposed to
quantify via the entropy of hashcodes in the above, through
the tight upper bound in Theorem 2 which is much easier and
cheap/accurate to compute than the entropy term itself.
Besides the mutual information lower bound, for more detailed
analysis, we propose to use the upper bound as an evaluation
metric for the dialogue problem when modeling via hashing
functions.</p>
        <p>The proposed information theoretic bounds are generically
applicable for any high dimensional variables. For instance,
our MI LB and Entropy UB are generically applicable for
other paragraph-embedding like representations [Wieting et
al., 2015; Arora et al., 2016], though more efficient to
compute on binary ones. See Appendix section for derivation of
the bounds.</p>
        <p>Previously, variational lower bounds on the mutual
information criterion have been proposed [Barber and Agakov,
2003; Chalk et al., 2016; Gao et al., 2016; Chen et al., 2016;
Alemi et al., 2017; Garg et al., 2018a] for settings where one
of the two variables is fixed (say class labels). This simplifies
the derivation significantly, compared to our setting.</p>
        <p>Next, we discuss details on how to optimize kernel- or
neural-hashing (Alg. 1) by maximizing our MI LB.
Kernel-RkNN</p>
        <p>M=30, =10
In the above, we formalized an objective function, i.e. a lower
bound on the mutual information criterion, to optimize all the
hash functions jointly for dialogue modeling, while keeping
each hash function randomized so as to preserve the
property of locality sensitive hashing. In this section, we discuss
optimization details specific to the kernel- or neural-hashing
models as described in Sec. 3.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Optimizing Kernel-Hashing Models for Dialogue</title>
      </sec>
      <sec id="sec-4-6">
        <title>Modeling</title>
        <p>For kernel-hashing in 1, we can optimize the reference set,
SR, or the kernel parameters. For the selection of
structures in SR, we use a greedy algorithm maximizing the
proposed mutual information lower bound (in Theorem 1);
see the pseudo code in Alg. 3. We initialize a reference set
of small size I M , by randomly selecting responses
from the training set of patient/therapist responses, i.e. S =
fS1p; ; SNp ; S1t ; ; SNt g; though, as noted before, the
superset S for the random selection can be any set of
sentences/paragraphs, not necessarily coming from a dataset of
patient/therapist responses. First, each element in the initial
reference set is optimized greedily, and then more elements
are added one by one until the reference set size grows to
M . When optimizing each element in the set, for
computing the MI lower bound, we sample number of response
pairs from the training set of patient/therapist responses pairs,
f(Sip; St)gi=1. For computational efficiency, we adopt the</p>
        <p>N
i
idea of sampling for the candidate set as well, in each greedy
optimization step, by sampling a subset of candidates of size
from the set S.</p>
        <p>The computation cost in the optimization is dominated by
the number convolution kernel similarities, i.e. O( (M 2 +
M )). In practice, we can keep low values of as well as
; in our experiments, we use = 1000; = 100, and vary
the value of M from 30 upto 300. A similar procedure can be
used to optimize kernel parameters.</p>
      </sec>
      <sec id="sec-4-7">
        <title>Optimizing Neural Network Architecture in</title>
      </sec>
      <sec id="sec-4-8">
        <title>Neural-Hashing Models</title>
        <p>As mentioned previously, if using language neural models for
hashing in Alg. 1, we can optimize the number of layers and
the units in each layer, by maximizing the proposed MI LB;
see pseudo code in Alg. 2.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>In this section, we discuss our experimental simulations,
primarily on a dataset of textual transcriptions of the audio
recordings of 400 depression therapy sessions, with each
session conducted by a human therapist with a real patient,
overall involving hundreds of patients and therapists, along with
an additional dataset of interview transcripts between Larry
King (host) and his guests 4. For the purposes of evaluations
in this paper, we put all the pairs of patient/therapist responses
p
from different sessions together in a single set, fSi ; SitgiN=1,
of size N=42000. We perform random 90%-10% split of the
dataset into a training (38,000 response pairs), and a test
set (4200 response pairs), respectively. 5 Similarly, for the
second dataset, we put together all pairs of host/guest
responses from 75 sessions in a single set of size 8200. Further,
we perform 10 trials for obtaining the statistics on evaluation
metrics in each experiment, with 95% response pairs sampled
from the training subset, and the same percent of pairs from
the test subset.</p>
      <sec id="sec-5-1">
        <title>Hashing Settings</title>
        <p>The number of hashcode bits is hundred, H = 100, for
both patient and therapist responses. To obtain hashcodes of
responses, for each sentence in a textual response of a
patient/therapist, we obtain Part of Speech Tags as additional
features, known as POS.</p>
        <p>In case of kernel-hashing, we use subsequence
kernels [Mooney and Bunescu, 2005] for computing similarity</p>
        <p>4http://transcripts.cnn.com/TRANSCRIPTS/
lkl.html</p>
        <p>5We use random seed value zero for numpy.random package.
Can you? They’re leaving, they’re leaving.</p>
        <p>Yes.</p>
        <p>When I took the 60, I didn’t sleep for like two days.</p>
        <p>I feel like it is doing nothing.</p>
        <p>Talk to me and listen to me.</p>
        <p>Mm-hm.</p>
        <p>Uh-huh.</p>
        <p>No, I’m not about to tell him. Hum-um.</p>
        <p>It was one of the few things in there that I actually bought from her. None of the things I think... She was trying hard to say the right
things. Where have I ever heard that from? So I leave the conversation and go, ”All right, well, we’ll see what happens.” And what
she asked in return is that when I’m angry at her, tell her. ”I’m angry at you. I’m not going to talk to you for a while now.” She said,
”As long as I know.”
Sure. I’ll see you day after tomorrow.</p>
        <p>You’re not going any further in therapy right now because you haven’t decided which way you want to go. It really, to me — okay,
maybe I’m over playing it but it seems like a parting in the way. Which game are you going to play? Which game do you want to be
good at? And you keep pussyfooting saying well, I like a little bit of this and I like a little bit of that.</p>
        <p>And I can’t breathe. I-I-I can still breathe, you know.</p>
        <p>Right.</p>
        <p>You’re not going to church or nothing —</p>
        <p>Yeah, it’s basically the, um, the recordings are sort of a $50 subsidy –
between two responses 6, while similarity between a pair of
words is computed using wordnet if their POS tags match.</p>
        <p>For the size of reference set SR, we try values, 30, 100,
300 ( = 10). Reference set is initialized either randomly as a
subset of all the responses in the training set (Random
Selection), or the sub-selection is optimized with Alg. 3 (MI LB
Optimal), with configurations = 1000; = 100; I = 15.</p>
        <p>See Tab. 3. We explore two different kernel locality
sensitive hashing techniques [Joly and Buisson, 2011; Garg et al.,
2018b], referred as RMM 7 and RkNN respectively, while
considering the latter one as our primary choice for detailed
analysis.</p>
        <p>For the case of neural-hashing, we use all the responses
from the training set as the reference set of size 76,000; since
the reference set is very large, we use random subsets of
larger size for constructing hash functions, i.e. = 50. We
use an LSTM model to construct each hash function, trained
using Adams algorithm [Kingma and Ba, 2014], with
learning rate 1e-3, amsgrad=True, and l1, l2 regularization
coefficients of value 1e-4, and the gradients are computed with
back propagation. We initialize a word in a response and its
POS tag with a randomly initialized word vector of size 30;
for a single time step processing with in LSTM, word
vectors of 10 adjacent words, along with their POS tags, are
appended into a vector of size 600; this is required to avoid
vanishing gradients since patient responses can be of length up
to 8,000 words in the training dataset. For the H number of
LSTM models as neural-hash functions, same neural
architecture, i.e., same number of layers and units, are used in each
model. When optimizing the architecture of the LSTM
models with Alg. 2 by maximizing our proposed MI LB (MI LB
Optimal), we add layers one by one greedily up to maximum
possible 4 layers (L = 4; = 1000), and try out different
possible numbers of normal units in each layer, i.e., 4, 8, 16,
6 = 0:8, subsequence of length up to 8 are matched
7In this hashing technique, an SVM is learned for each hash
function. We use C = 0:1 in SVM, keeping it highly regularized.
32, 64. 8</p>
      </sec>
      <sec id="sec-5-2">
        <title>Random Forests for Inferring Therapist Hashcodes</title>
        <p>From the above model, we obtain hashcodes for patient and
therapist responses in the training and test subsets, serving
as ground truth for the task of inferring the hashcode of the
therapist response, given the hashcode of a patient response.
For inferring each bit of therapist code, we train an individual
Random Forest (RF) classifier containing 100 decision trees.
All of the hundred RF classifiers (since H = 100) share the
same features (i.e. the hashcodes of patient responses) though
trained independent of each other.
5.1</p>
      </sec>
      <sec id="sec-5-3">
        <title>Evaluation Results</title>
        <p>Evaluation metrics Our primary evaluation metric is our
proposed MI lower bound (MI LB) in Theorem 1. We also
use the proposed upper bound on the entropy of therapist
hashcodes in Theorem 2 to evaluate the quality of hashcodes
as representations of responses (Entropy UB). In addition,
we express the ratio of MI LB and Entropy UB as
Normalized MI LB (NMI LB). It is also an interesting to analyze
MI LB for random pairings between patient and therapist
responses (denoted as “Shuffled”); same applies for NMI LB.</p>
        <p>Interpreting the accuracy metric (Accuracy), i.e. the
accuracy of inferring each hashcode bit using the RF
classifiers, requires a more detailed discussion. Since the class
labels (therapist hashcodes) are not fixed, the accuracy is
inversely correlated with the Entropy UB metric. We also
obtain the baseline accuracy (Baseline), using a trivial
classifier that always chooses the most-frequent class label. While
analyzing the absolute numbers of accuracy is not
meaningful here, we can see the relative improvement of accuracy
w.r.t. the accuracy of the baseline dummy classifier. The mean
and standard deviation statistics for each metric are computed
over 10 runs of the experiment, as mentioned above; in the
case of the accuracy metric, the statistics are computed over</p>
        <p>8We keep the number of units small, since is small.
Run away and, and be by myself.</p>
        <p>Um-hum.
(i) Uh-huh. (ii) Oh. (iii) That doesn’t pan out.</p>
        <p>Yeah, we’ll be in touch about all that stuff.</p>
        <p>Okay. So very nice to see you again.
(i) OK, I see. (ii) Kind of like that- (iii) Good, good. What’s new anything ?
Sensitive. Yeah. (inaudible) just like (inaudible) very uncomfortable.</p>
        <p>Uh-huh.
(i) What’s going on with the moods? (ii) Big people don’t feel this way like? (iii) Scares you how?
Yeah. It just feels moderately depressing. My head spins
Do you feel depressed?
(i) So eventually it will die out. (ii) So it may peter out later. (iii) We can stop for now.
Well, am I kidding myself when I’m saying nothing’s bothering me? So that bothers me.
When you don’t - whether you can really trust your own feelings.
(i) To be at - all of us. (ii) I’m sure. (iii) That sounds exhausting, yeah.
(laughs) Um, yeah. (laughs)
Forever.
(i) OK, I see. (ii) Okay, I understand. (iii) All right, take it.</p>
        <p>But it’s like that’s what I’ve been doing. And I don’t, I can’t scrape up the money to go back to school.
Yeah.</p>
        <p>(i) Um-hum. (ii) Mm-hmm. (iii) Hmmm.
all (100) hashcode bits and over 10 trials. For each of the
metrics, higher values mean better performance. In addition, by
subtracting MI LB from Entropy UB, we obtain the inference
quality.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Depression Dataset Results</title>
        <p>We present the results for our primary evaluation on the
depression dataset in Tab. 2. Overall, our MI LB optimization
based approach is outperforming the alternatives in most of
the settings, for the MI LB as well as Entropy UB metrics
(best results shown in boldface). We will first discuss in more
detail our results for kernel-based hashing.</p>
        <p>Kernel-Hashing Since the performance of a
kernelhashing model varies for a random initialization of SR, we
perform multiple trials in that case, and report numbers for
the trial which gives the median value for MI LB. For
RkNNKernel hashing, we vary sizes of the reference set SR, i.e.
M 2 f30; 100; 300g. For M = 30, MI LB (-11.5) and
Entropy UB (5.7) values are low when selecting the reference set
elements randomly (Random Selection). On the other hand, if
the reference set is optimized (MI LB Optimal), MI LB (4.4)
and Entropy UB (30.1) are of much higher values, as desired.
For M = 100, MI LB values increase significantly compared
to the values for M = 30; thus high value of M is
advantageous. We see the benefits of optimizing the reference set (MI
LB: 13.6, Entropy UB: 26.2), compared to the Random
Selection case (MI LB: 9.8, Entropy UB: 19.9), for M = 100
too. For a further increase in the size of the reference set,
M = 300, we obtain even higher values for MI LB (15.7),
though not a significant increase in Entropy UB (24.0),
especially when selecting the reference set randomly. In this
setting, the advantage of optimizing the reference set is
somewhat marginal. For robustness, it is always good to optimize
the set rather than a random selection.</p>
        <p>For kernel-RMM, we use M = 100, obtaining similar
results as with kernel-RkNN hashing. Here it is interesting to
see that the ratio of MI LB (11.4) w.r.t. its baseline of MI LB
for shuffled pairs (2.8), is significantly higher (4.1) than other
models. This ratio should serve as another metric derived
from MI LBs, to judge the quality of a dialogue model (higher
value of the ratio is better).</p>
        <p>With an increase in M when selecting the reference set
randomly, we observe decrease in baseline accuracy since
the Entropy UB increases in value. Also, when
optimizing the reference set instead of Random selection, Entropy
UB increases, and therefore the baseline accuracy decreases.
Overall, when the difference between Entropy UB and MI
LB (quantifying inference quality) is high, the positive gap
between the accuracy numbers and the baseline accuracy
numbers is more significant. Variance for accuracy is high
because for some bits in hashcodes, the distribution of 0s and
1s is imbalanced (say, 99%-1%), making the inference very
challenging for those bits.</p>
        <p>Herein, it is also interesting to observe MI LB values for
the “Shuffled” setting (presented in brackets), i.e.
computing MI LB for random pairings of patient/therapist responses.
This should serve as a good metric to analyze a hashing based
dialogue model as well. Since we are modeling responses
from all therapy sessions put together into a single set, one
should expect that a good model should have low value of MI
LB for the shuffled setting; though, it is natural to expect a
minimal value of mutual information even between randomly
paired responses of patient &amp; therapist, from different
sessions, due to the common theme of depression psychotherapy
across the sessions.</p>
        <p>Model
Kernel-RkNN</p>
      </sec>
      <sec id="sec-5-5">
        <title>Neural-Hashing</title>
        <p>For the optimization of Neural-RLSTM hashing model,
Alg. 2 gives a simple LSTM architecture of [16, 64, 16, 8],
i.e. four layers, with 16 LSTM units in the first layer
(bottom layer receiving the inputs), 64 LSTM units in the second
layer, and so on. For this model, we obtain relatively higher
values for MI LB (17.2), compared to the kernel-hashing
above. Besides the optimized architecture for LSTM
models, we also tried manually built architectures; for instance, if
using simple LSTM model of single layer with 10 units for
each hash function, we obtain MI LB value, 6.7, that is
significantly lower compared to the optimized Neural-RLSTM
model (17.2), and many of the kernel-hashing models above.</p>
        <p>The above results demonstrate that our approach of
optimizing hashing functions by maximizing our novel MI LB
leads to higher values for the evaluation metrics as desired,
across the kernel- as well as neural-hashing models. There
is a trade off between kernel-hashing and neural-hashing. In
kernel-hashing, as per the selected responses in the optimized
reference set, one can get insights on what patterns are
relevant for a dialog; for e.g., see Tab. 3. On the other hand with
neural-hashing, one can take advantage of the vast variety of
neural language models available.</p>
        <p>Further, we can map an inferred hashcode to textual
responses using a repository of indexed sentences. See Tab. 4.</p>
      </sec>
      <sec id="sec-5-6">
        <title>Larry King Dataset Results</title>
        <p>Some selected results for the Larry King dataset are presented
in Tab. 5. Herein, it is interesting to see that we get very high
value of MI LB, and Entropy UB with our neural hashing
approach (optimized architecture is [8,32]), in comparison to
our kernel hashing. Also, mean accuracy with neural
hashing is significantly higher than the baseline accuracy number.
Though, NMI LB values for the two hashing approaches are
relatively close.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper introduces a novel approach to dialogue
modeling based on hash functions, using psychotherapy sessions
as a motivating domain. In our framework, responses from
both parties (e.g., patient and therapist) are represented by
the corresponding hashcodes, capturing certain text patterns.
Furthermore, we propose a novel lower bound on Mutual
Information in order to characterize the relevance of a
therapist’s response to the patient’s text, and vice versa. Moreover,
in order to characterize the general quality of hashcodes as
response representations, we propose a tight upper bound on
the joint entropy of hashcodes. We performed empirical
evaluation of the proposed approach on the dataset containing
depression therapy sessions between real patients and
therapists. We optimized locality sensitive hashing models, based
on kernel functions or neural language models, by
maximizing the proposed MI lower bound as an objective function.
Our results consistently demonstrate superior performance of
the proposed approach over several alternatives, as measured
by several evaluation metrics.</p>
    </sec>
    <sec id="sec-7">
      <title>Derivations of the Information Theoretic</title>
    </sec>
    <sec id="sec-8">
      <title>Bounds</title>
      <p>Before the discussion of our novel lower bound, we
introduce the information-theoretic quantity called Total
Correlation (T C), which captures non-linear correlation among the
dimensions of a random variable X , i.e.,</p>
      <p>T C(X ; Mh) = X H(X (j); Mh)</p>
      <p>H(X ; Mh);
j
T C(X : Y; Mh) = T C(X ; Mh)
T C(X jY ; Mh):
(5)
Intuitively, (5) describes the amount of information within X
that can be explained by Y.</p>
      <p>Along these lines, the mutual information quantity between
the hashcodes can be decomposed as in Lemma 1 below.
Lemma 1 (Mutual Information Decomposition). Mutual
Information between Ct and Cp is decomposed as follows:
I(Ct : Cp; Mh)
= X I(Ct(j) : Cp; Mh)
j</p>
      <p>T C(Ct : Cp; Mh):
(6)</p>
      <p>Looking at the first term of RHS in (6), it is the mutual
information between a one-dimensional and multi-dimensional
random variable.</p>
      <p>For these terms, since one of the variables is only 1-D, we
can use the existing technique of variational bounds for an
approximation [Gao et al., 2016], as in Lemma 2 below.
Lemma 2. Marginal mutual information for each bit in
therapist hashcodes, I(Ct(j) : Cp; Mh), is lower bounded as,
I(Ct(j) : Cp; Mh)</p>
      <p>H(Ct(j); Mh)+hlog q(Ct(j)jCp; Mh)ip(Ct(j);Cp;Mh): (7)
Herein, H(Ct(j); Mh) is easy to compute because Ct(j)
is one-dimensional. For each of the proposal distributions
q(Ct(j)jCp; Mh), we propose to use a Random Forest (RF)
classifier [Garg et al., 2018a].</p>
      <p>In reference to the second term of RHS in (6), it is
computationally intractable to compute the total correlation
expression T C(Ct : Cp; Mh), which denotes the total correlations
between bits of Ct, explainable by Cp. So, we would also like
to obtain an upper bound of T C(Ct : Cp; Mh), which is cheap
to compute, that would give us a lower bound for the second
term in (6) because of the negative sign.</p>
      <p>Lemma 3. T C(Ct : Cp; Mh) can be upper bounded as:
T C(Ct : Cp; Mh)
wherein j:j denotes the dimensionality of a random variable.</p>
      <p>Although it is intractable to compute the original term
T C(Ct : Cp; Mh), it is possible to compute T C(Ct :
Y ; Mh) for a latent variable representation Y of Ct that
maximally explains the Total Correlations in Ct.</p>
      <p>We can think of the computation of the upper bound as an
unsupervised learning problem. We propose to use an
existing algorithm, CorEx, for the unsupervised learning of latent
random variables representation Y [Ver Steeg and Galstyan,
2014].</p>
      <p>It is important to note some practical considerations about
the upper bound. In the case of a suboptimal solution to the
maximization of T C(Ct : Y; Mh) above, the optimized
quantity may not be an upper bound of T C(Ct : Cp; Mh), but
rather an approximation. Also, the upper bound would not
be tight if Cp doesn’t explain much of total correlations in
Ct. Further, for even more computation cost reductions
during the learning, the dimension of the latent representation Y
can be kept much smaller than the dimension of hashcodes,
i.e. jY j jCpj for jCpj 1; this is because even a small
number of latent variables should explain most of the total
correlations for practical purposes as demonstrated in [Ver
Steeg and Galstyan, 2014], and observed in our experiments
on hashcodes as well.</p>
      <p>Combining (7) and (8) into (6), we get the lower bound in
(3) in Theorem 1.</p>
      <p>Along same lines, we derive the tight upper bound on joint
entropy of hashcodes in Theorem 2. From the definition of
Total Correlation above (5), we have the following,</p>
      <p>T C(Ct; Mh) = T C(Ct : Y ; Mh) + T C(CtjY ; Mh);
and finally the expression below.
From this derived expression, we can simply obtain the upper
bound and the corresponding gap.</p>
      <sec id="sec-8-1">
        <title>Previous Lower Bounds for Mutual Information: Vari</title>
        <p>ational lower bounds on the mutual information criterion
have been proposed in the past [Barber and Agakov, 2003;
Chalk et al., 2016; Gao et al., 2016; Chen et al., 2016;
Alemi et al., 2017; Garg et al., 2018a]. Their lower bounds
works only when one of the variables is fixed, say if Ct were
fixed. In our objective, not only Ct is a functional of the
hashing model that we are learning, it is high dimensional.
Unless we have a lower bound for the entropy term H(Ct; Mh)
as well, which should be hard to obtain, we can not use the
above mentioned variational lower bounds for our problem
as such. Besides, it is also non-trivial to find an
appropriate proposal distribution q(CtjCp; Mh). Therefore, we adopt
a different approach for obtaining a novel lower bound on the
mutual information quantity, as described above.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Alemi et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Alemi</surname>
          </string-name>
          , Ian Fischer,
          <string-name>
            <given-names>Josh</given-names>
            <surname>Dillon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Murphy</surname>
          </string-name>
          .
          <source>Deep variational information bottleneck</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Althoff et al.,
          <year>2016</year>
          ] Tim Althoff, Kevin Clark, and
          <string-name>
            <given-names>Jure</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Large-scale analysis of counseling conversations: An application of natural language processing to mental health</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>4</volume>
          :
          <fpage>463</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Arora et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Sanjeev</given-names>
            <surname>Arora</surname>
          </string-name>
          , Yingyu Liang, and
          <string-name>
            <given-names>Tengyu</given-names>
            <surname>Ma</surname>
          </string-name>
          .
          <article-title>A simple but tough-to-beat baseline for sentence embeddings</article-title>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Asghar et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Nabiha</given-names>
            <surname>Asghar</surname>
          </string-name>
          , Pascal Poupart, Xin Jiang, and
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Deep active learning for dialogue generation</article-title>
          .
          <source>In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM</source>
          <year>2017</year>
          ), pages
          <fpage>78</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Barber and Agakov</source>
          , 2003]
          <string-name>
            <given-names>David</given-names>
            <surname>Barber</surname>
          </string-name>
          and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Agakov</surname>
          </string-name>
          .
          <article-title>The im algorithm: a variational approach to information maximization</article-title>
          .
          <source>In Proceedings of the 16th International Conference on Neural Information Processing Systems</source>
          , pages
          <fpage>201</fpage>
          -
          <lpage>208</lpage>
          . MIT Press,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Bartl and Spanakis</source>
          , 2017]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Bartl</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gerasimos</given-names>
            <surname>Spanakis</surname>
          </string-name>
          .
          <article-title>A retrieval-based dialogue system utilizing utterance and context embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1710.05780</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Chalk et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Chalk</surname>
          </string-name>
          , Olivier Marre, and
          <string-name>
            <given-names>Gasper</given-names>
            <surname>Tkacik</surname>
          </string-name>
          .
          <article-title>Relevant sparse codes with variational information bottleneck</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>1957</fpage>
          -
          <lpage>1965</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Chen et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Xi</given-names>
            <surname>Chen</surname>
          </string-name>
          , Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and
          <string-name>
            <given-names>Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          . Infogan:
          <article-title>Interpretable representation learning by information maximizing generative adversarial nets</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2172</fpage>
          -
          <lpage>2180</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Collins and Duffy</source>
          , 2002]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Collins</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nigel</given-names>
            <surname>Duffy</surname>
          </string-name>
          .
          <article-title>Convolution kernels for natural language</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>625</fpage>
          -
          <lpage>632</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Fitzpatrick et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Kathleen</given-names>
            <surname>Kara</surname>
          </string-name>
          <string-name>
            <surname>Fitzpatrick</surname>
          </string-name>
          , Alison Darcy, and
          <string-name>
            <given-names>Molly</given-names>
            <surname>Vierhile</surname>
          </string-name>
          .
          <article-title>Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (woebot): A randomized controlled trial</article-title>
          .
          <source>JMIR Mental Health</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):e19,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>[Gao</surname>
          </string-name>
          et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Shuyang</given-names>
            <surname>Gao</surname>
          </string-name>
          , Greg Ver Steeg, and
          <string-name>
            <given-names>Aram</given-names>
            <surname>Galstyan</surname>
          </string-name>
          .
          <article-title>Variational information maximization for feature selection</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>487</fpage>
          -
          <lpage>495</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Garg et al., 2018a]
          <string-name>
            <given-names>Sahil</given-names>
            <surname>Garg</surname>
          </string-name>
          , Aram Galstyan, Greg Ver Steeg, Irina Rish, Guillermo Cecchi, and
          <string-name>
            <given-names>Shuyang</given-names>
            <surname>Gao</surname>
          </string-name>
          .
          <article-title>Efficient representation for natural language processing via kernelized hashcodes</article-title>
          .
          <source>arXiv preprint arXiv:1711.04044</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Garg et al., 2018b]
          <string-name>
            <given-names>Sahil</given-names>
            <surname>Garg</surname>
          </string-name>
          , Greg Ver Steeg, and
          <string-name>
            <given-names>Aram</given-names>
            <surname>Galstyan</surname>
          </string-name>
          .
          <article-title>Stochastic learning of nonstationary kernels for natural language modeling</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .03911,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>[Haussler</source>
          , 1999]
          <string-name>
            <given-names>David</given-names>
            <surname>Haussler</surname>
          </string-name>
          .
          <article-title>Convolution kernels on discrete structures</article-title>
          .
          <source>Technical report</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [He et al.,
          <year>2017</year>
          ]
          <string-name>
            <surname>He</surname>
            <given-names>He</given-names>
          </string-name>
          , Anusha Balakrishnan, Mihail Eric, and
          <string-name>
            <given-names>Percy</given-names>
            <surname>Liang</surname>
          </string-name>
          .
          <article-title>Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1704.07130</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Joly and Buisson</source>
          , 2011]
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Joly</surname>
          </string-name>
          and
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Buisson</surname>
          </string-name>
          .
          <article-title>Random maximum margin hashing</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2011 IEEE Conference on</source>
          , pages
          <fpage>873</fpage>
          -
          <lpage>880</lpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Jurafsky and Martin</source>
          , 2014]
          <string-name>
            <given-names>D</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>JH</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <article-title>Dialog systems and chatbots</article-title>
          .
          <source>Speech and language processing</source>
          ,
          <volume>3</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Kingma and Ba</source>
          , 2014]
          <article-title>Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[Komorowski and Trzcinski</source>
          , 2017]
          <string-name>
            <given-names>Michal</given-names>
            <surname>Komorowski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomasz</given-names>
            <surname>Trzcinski</surname>
          </string-name>
          .
          <article-title>Random binary trees for approximate nearest neighbour search in binary space</article-title>
          .
          <source>arXiv preprint arXiv:1708.02976</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Kraskov et al.,
          <year>2004</year>
          ]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Kraskov</surname>
          </string-name>
          , Harald Sto¨gbauer, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Grassberger</surname>
          </string-name>
          .
          <article-title>Estimating mutual information</article-title>
          . Physical Review E,
          <volume>69</volume>
          :
          <fpage>066138</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Kraskov et al.,
          <year>2005</year>
          ]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Kraskov</surname>
          </string-name>
          , Harald Sto¨gbauer, Ralph G Andrzejak, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Grassberger</surname>
          </string-name>
          .
          <article-title>Hierarchical clustering using mutual information</article-title>
          .
          <source>EPL (Europhysics Letters)</source>
          ,
          <volume>70</volume>
          (
          <issue>2</issue>
          ):
          <fpage>278</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Kulis and Grauman</source>
          , 2009]
          <string-name>
            <given-names>Brian</given-names>
            <surname>Kulis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kristen</given-names>
            <surname>Grauman</surname>
          </string-name>
          .
          <article-title>Kernelized locality-sensitive hashing for scalable image search</article-title>
          .
          <source>In Computer Vision</source>
          ,
          <year>2009</year>
          IEEE 12th International Conference on, pages
          <fpage>2130</fpage>
          -
          <lpage>2137</lpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>[Kulis and Grauman</source>
          , 2012]
          <string-name>
            <given-names>Brian</given-names>
            <surname>Kulis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kristen</given-names>
            <surname>Grauman</surname>
          </string-name>
          .
          <article-title>Kernelized locality-sensitive hashing</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>34</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1092</fpage>
          -
          <lpage>1104</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Lewinsohn et al.,
          <year>1990</year>
          ]
          <string-name>
            <surname>Peter M Lewinsohn</surname>
            ,
            <given-names>Gregory N Clarke</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hyman Hops</surname>
            , and
            <given-names>Judy</given-names>
          </string-name>
          <string-name>
            <surname>Andrews</surname>
          </string-name>
          .
          <article-title>Cognitivebehavioral treatment for depressed adolescents</article-title>
          .
          <source>Behavior Therapy</source>
          ,
          <volume>21</volume>
          (
          <issue>4</issue>
          ):
          <fpage>385</fpage>
          -
          <lpage>401</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>[Lewis</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          , Denis Yarats, Yann N Dauphin,
          <string-name>
            <surname>Devi Parikh</surname>
            , and
            <given-names>Dhruv</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          .
          <article-title>Deal or no deal? end-to-end learning for negotiation dialogues</article-title>
          .
          <source>arXiv preprint arXiv:1706.05125</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>[Li and Jurafsky</source>
          , 2016]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <article-title>Neural net models for open-domain discourse coherence</article-title>
          .
          <source>arXiv preprint arXiv:1606.01545</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michel</given-names>
            <surname>Galley</surname>
          </string-name>
          , Chris Brockett,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Dolan</surname>
          </string-name>
          .
          <article-title>A diversity-promoting objective function for neural conversation models</article-title>
          .
          <source>arXiv preprint arXiv:1510.03055</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Will</given-names>
            <surname>Monroe</surname>
          </string-name>
          , Tianlin Shi,
          <string-name>
            <given-names>Alan</given-names>
            <surname>Ritter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <article-title>Adversarial learning for neural dialogue generation</article-title>
          .
          <source>arXiv preprint arXiv:1701.06547</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [Liu et al.,
          <year>2016</year>
          ]
          <string-name>
            <surname>Chia-Wei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Ryan Lowe, Iulian V Serban,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Noseworthy</surname>
          </string-name>
          , Laurent Charlin, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <article-title>How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation</article-title>
          .
          <source>arXiv preprint arXiv:1603.08023</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [Lowe et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Ryan</given-names>
            <surname>Lowe</surname>
          </string-name>
          , Michael Noseworthy, Iulian V Serban,
          <string-name>
            <surname>Nicolas</surname>
            Angelard-Gontier,
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
            , and
            <given-names>Joelle</given-names>
          </string-name>
          <string-name>
            <surname>Pineau</surname>
          </string-name>
          .
          <article-title>Towards an automatic turing test: Learning to evaluate dialogue responses</article-title>
          .
          <source>arXiv preprint arXiv:1708.07149</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>[Mooney and Bunescu</source>
          , 2005]
          <article-title>Raymond J Mooney and Razvan C Bunescu. Subsequence kernels for relation extraction</article-title>
          .
          <source>In Proc. of NIPS</source>
          , pages
          <fpage>171</fpage>
          -
          <lpage>178</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [Norouzi et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Norouzi</surname>
          </string-name>
          , Ali Punjani, and
          <string-name>
            <surname>David</surname>
          </string-name>
          J Fleet.
          <article-title>Fast exact search in hamming space with multi-index hashing</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <volume>36</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1107</fpage>
          -
          <lpage>1119</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [Papineni et al.,
          <year>2002</year>
          ]
          <string-name>
            <given-names>Kishore</given-names>
            <surname>Papineni</surname>
          </string-name>
          , Salim Roukos, Todd Ward, and
          <string-name>
            <surname>Wei-Jing Zhu</surname>
          </string-name>
          .
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In Proceedings of the 40th annual meeting on association for computational linguistics</source>
          , pages
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . Association for Computational Linguistics,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [Serban et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Iulian</given-names>
            <surname>Vlad</surname>
          </string-name>
          <string-name>
            <surname>Serban</surname>
          </string-name>
          , Ryan Lowe,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Henderson</surname>
          </string-name>
          , Laurent Charlin, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <article-title>A survey of available corpora for building data-driven dialogue systems</article-title>
          .
          <source>arXiv preprint arXiv:1512.05742</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [Serban et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Iulian</given-names>
            <surname>Vlad</surname>
          </string-name>
          <string-name>
            <surname>Serban</surname>
          </string-name>
          , Ryan Lowe, Laurent Charlin, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <article-title>Generative deep neural networks for dialogue: A short review</article-title>
          .
          <source>arXiv preprint arXiv:1611.06216</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [Serban et al.,
          <source>2017a] Iulian Vlad Serban</source>
          , Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, and
          <string-name>
            <surname>Aaron</surname>
            <given-names>C Courville.</given-names>
          </string-name>
          <article-title>Multiresolution recurrent neural networks: An application to dialogue response generation</article-title>
          .
          <source>In AAAI</source>
          , pages
          <fpage>3288</fpage>
          -
          <lpage>3294</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [Serban et al.,
          <source>2017b] Iulian Vlad Serban</source>
          , Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>A hierarchical latent variable encoder-decoder model for generating dialogues</article-title>
          .
          <source>In AAAI</source>
          , pages
          <fpage>3295</fpage>
          -
          <lpage>3301</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [Shao et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Yuanlong</given-names>
            <surname>Shao</surname>
          </string-name>
          , Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and
          <string-name>
            <given-names>Ray</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          .
          <article-title>Generating high-quality and informative conversation responses with sequence-to-sequence models</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2200</fpage>
          -
          <lpage>2209</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>[Studeny` and Vejnarova´</source>
          ,
          <year>1998</year>
          ]
          <article-title>Milan Studeny` and Jirina Vejnarova´. The multiinformation function as a tool for measuring stochastic dependence</article-title>
          .
          <source>In Learning in graphical models</source>
          , pages
          <fpage>261</fpage>
          -
          <lpage>297</lpage>
          . Springer,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>[Ver Steeg and Galstyan</source>
          , 2014]
          <article-title>Greg Ver Steeg</article-title>
          and
          <string-name>
            <given-names>Aram</given-names>
            <surname>Galstyan</surname>
          </string-name>
          .
          <article-title>Discovering structure in high-dimensional data through correlation explanation</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>27</volume>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>[Wang</surname>
          </string-name>
          et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Jingdong</given-names>
            <surname>Wang</surname>
          </string-name>
          , Heng Tao Shen,
          <string-name>
            <given-names>Jingkuan</given-names>
            <surname>Song</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jianqiu</given-names>
            <surname>Ji</surname>
          </string-name>
          .
          <article-title>Hashing for similarity search: A survey</article-title>
          .
          <source>arXiv preprint arXiv:1408.2927</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>[Wang</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Jingdong</given-names>
            <surname>Wang</surname>
          </string-name>
          , Ting Zhang, Nicu Sebe, Heng Tao Shen, et al.
          <article-title>A survey on learning to hash</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>[Watanabe</source>
          , 1960]
          <string-name>
            <given-names>Satosi</given-names>
            <surname>Watanabe</surname>
          </string-name>
          .
          <article-title>Information theoretical analysis of multivariate correlation</article-title>
          .
          <source>IBM Journal of research and development</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>66</fpage>
          -
          <lpage>82</lpage>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [Wen et al.,
          <year>2016</year>
          ]
          <string-name>
            <surname>Tsung-Hsien</surname>
            <given-names>Wen</given-names>
          </string-name>
          , David Vandyke,
          <string-name>
            <given-names>Nikola</given-names>
            <surname>Mrksic</surname>
          </string-name>
          , Milica Gasic,
          <string-name>
            <surname>Lina M Rojas-Barahona</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pei-Hao</surname>
            <given-names>Su</given-names>
          </string-name>
          , Stefan Ultes, and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>A networkbased end-to-end trainable task-oriented dialogue system</article-title>
          .
          <source>arXiv preprint arXiv:1604.04562</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [Wieting et al.,
          <year>2015</year>
          ] John Wieting, Mohit Bansal, Kevin Gimpel, and
          <string-name>
            <given-names>Karen</given-names>
            <surname>Livescu</surname>
          </string-name>
          .
          <article-title>Towards universal paraphrastic sentence embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1511.08198</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <surname>[Wu</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Yu</given-names>
            <surname>Wu</surname>
          </string-name>
          , Wei Wu, Chen Xing,
          <string-name>
            <surname>Ming Zhou</surname>
            , and
            <given-names>Zhoujun</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Sequential matching network: A new architecture for multi-turn response selection in retrievalbased chatbots</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          , pages
          <fpage>496</fpage>
          -
          <lpage>505</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <source>[Zhai and Williams</source>
          , 2014]
          <string-name>
            <given-names>Ke</given-names>
            <surname>Zhai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jason D</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Discovering latent structure in task-oriented dialogues</article-title>
          .
          <source>In ACL (1)</source>
          , pages
          <fpage>36</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>