Efficient Purely Convolutional Text Encoding Szymon Malik∗ , Adrian Lancucki∗ , Jan Chorowski Institute of Computer Science University of Wrocław szymon.w.malik@gmail.com, {alan,jch}@cs.uni.wroc.pl Abstract et al., 2018; van den Oord et al., 2017a; van den Oord et al., 2017b] among researchers and practitioners alike. In this In this work, we focus on a lightweight convolu- work, we focus on a lightweight convolutional architecture tional architecture that creates fixed-size vector em- that creates fixed-size representations of sentences. beddings of sentences. Such representations are Convolutional neural networks have the inherent ability to useful for building NLP systems, including con- detect local structures in the data. In the context of conver- versational agents. Our work derives from a re- sational systems, their speed and memory efficiency eases cently proposed recursive convolutional architec- deployment on mobile devices, allowing fast response re- ture for auto-encoding text paragraphs at byte level. trieval and better user experience. We analyze and build on We propose alternations that significantly reduce the recently proposed Byte-Level Recursive Convolutional training time, the number of parameters, and im- Auto-Encoder (BRCA) for text paragraphs [Zhang and Le- prove auto-encoding accuracy. Finally, we evaluate Cun, 2018], which is able to auto-encode text paragraphs into the representations created by our model on tasks fixed-size vectors, reading in bytes with no additional pre- from SentEval benchmark suite, and show that it processing. can serve as a better, yet fairly low-resource alter- Based on our analysis, we are able to explain the behav- native to popular bag-of-words embeddings. ior of this model, point out possible enhancements, achieve auto-encoding accuracy improvements and an order of mag- 1 Introduction nitude training speed-up, while cutting down the number of parameters by over 70%. We introduce a balanced padding Modern conversational agents often make use of retrieval- scheme for input sequences and show that it significantly im- based response generation modules [Ram et al., 2018], in proves convergence and capacity of the model. As we find which the response of the agent is retrieved from a curated byte-level encoding unsuitable for embedding sentences, we database. The retrieval can be implemented as similarity demonstrate its applicability in processing sentences at word- matching in a vector space, in which natural language sen- level. We train the encoder with supervision on Stanford tences are represented as fixed-size vectors. Cosine and Eu- Natural Language Inference corpus [Bowman et al., 2015; clidean distances typically serve as similarity measures. Such Conneau et al., 2017] and investigate its performance on var- approaches have been applied by participants of recent chat- ious transfer tasks to assess quality of produced embeddings. bot contests: The 2017 Alexa Prize [Pichl et al., 2018; Liu The paper is structured as follows: in Section 2 we intro- et al., 2017; Serban et al., 2017], and The 2017 NIPS Con- duce some of the notions that appear in the paper. Section 3 versational Intelligence Challenge [Chorowski et al., 2018; discusses relevant work for sentence vector representations. Yusupov and Kuratov, 2017]. Retrieval-based modules are Details of the architecture can be found in Section 4. Sec- fast and predictable. Most importantly, they enable soft tion 5 presents the analysis of the auto-encoder and motiva- matching between representations. Apart from this straight- tions for our improvements. Section 6 demonstrates super- forward application in dialogue systems, sentence embed- vised training of word-level sentence encoder, and evaluates dings are applicable in downstream NLP tasks relevant to di- it on tasks relevant to conversational systems. Section 7 con- alogue systems. Those include sentiment analysis [Pang and cludes the paper. Lee, 2008], question answering [Weissenborn et al., 2017], censorship [Chorowski et al., 2018], or intent detection. Due to the temporal nature of natural languages, recurrent 2 Preliminaries neural networks gained popularity in NLP tasks. Active re- An information retrieval conversational agent selects a re- search of different architectures led to great advances and sponse from a fixed set. Let D = {(ik , rk )}k be a set of eventually a shift towards methods using the transformer ar- conversational input-response pairs, and q be a current user’s chitecture [Vaswani et al., 2017] or convolutional layers [Bai input. Two simple ways of retrieving a response rk from ∗ available data are [Ritter et al., 2011; Chorowski et al., 2018]: Equal contribution 14 Output byte probabilities Output byte probabilities • return rk most similar to user’s input q, • return rk for which ik is most similar to q. Softmax Softmax Utterances ik , rk , q may be represented (embedded) as real- Post x Post x  valued vectors. Conv xN Conv & BN xN  DECODER Many NLP systems represent words as points in a continu- Recursive Recursive ous vector space using word embedding methods [Mikolov et Conv x N-1 Conv & BN x N-1 Applied al., 2013; Pennington et al., 2014; Bojanowski et al., 2016]. Expand Conv K-r times (variable) Expand Conv & BN Applied K-r times They are calculated based on co-occurrence of words in large Pre x corpora. The same methods were applied to obtain sentence  Linear xN embeddings only with partial success, due to the combinato- Feature Vector Feature Vector rial explosion of all possible word combinations, which make up a sentence. Instead, Recurrent Neural Networks (RNNs), Post x  Linear xN autoregressive models that can process input sequences of an ENCODER arbitrary length, are thought to be a good method for handling Applied Recursive Recursive Max Pool K-r times Max Pool Applied variable-length textual data, with Long Short-Term Memory (variable) K-r times Conv xN Conv & BN xN network (LSTM) [Hochreiter and Schmidhuber, 1997] being the prime example of such. Pre x Pre x Conv xN  Conv & BN xN Recently, RNNs have been reported to be successfully  replaced by convolutional architectures [Bai et al., 2018; van den Oord et al., 2017a; van den Oord et al., 2017b]. Con- Byte Embeddings Byte Embeddings volutional neural networks are traditionally associated with 2K input tokens 2K input tokens computer vision and image processing [Krizhevsky et al., (variable length) ( xed length) 2012; Redmon et al., 2015]. They primarily consist of con- (a) BRCA (b) Our model volutional layers that apply multiple convolutions to the in- put, followed by pooling layers that are used for reducing the Figure 1: Structural comparison of Byte-Level Recursive Convolu- dimensionality of the hidden state. Convolutional networks tional Auto-Encoder (BRCA) and our model. Dark boxes indicate are efficient during training and inference: they utilize few input padding. BRCA pads the input form the right to the nearest parameters and do not require sequential computations, mak- power of two. We pad the input evenly to a fixed-size vector. Our ing hardware parallelism easy to use. Due to their popularity model does not have postfix/prefix groups with linear layers and uses Batch Normalization (BN) after every layer. in image processing, there are efficient implementations that scale well. Residual connection [He et al., 2015] is a connection that generate sentences si−1 and si+1 from this representation. adds an unchanged input to the output of the layer or block SkipThought vectors have been shown to preserve syntactic of layers. During the forward pass it provides upper layers and semantic properties [Kiros et al., 2015], so that their with undistorted signal from the input and intermediate lay- similarity is represented in the embedding space. ers. During the backward pass it mitigates vanishing and ex- InferSent model [Conneau et al., 2017] shows that training ploding gradient problems [Hochreiter et al., 2001]. embedding systems with supervision on a natural language Batch Normalization [Ioffe and Szegedy, 2015] (BN) ap- inference task may be superior to an unsupervised training. plies normalization to all activations in every minibatch. Recently, better results were obtained by combining super- Typical operations used in neural networks are sensitive to vised learning on an auxiliary natural language inference cor- changes in the range and magnitude of inputs. During train- pus with learning to predict the correct response on a conver- ing the inputs to upper layers vary greatly due to changes in sational data corpus [Yang et al., 2018]. weights of the model. Normalization of the signal in each layer has the potential to alleviate this problem. Both BN 4 Model Description and residual connections enable faster convergence by help- ing with forward and backward flow of information. They Our model builds on the Byte-Level Recursive Convolutional were also crucial in training our models. Auto-Encoder [Zhang and LeCun, 2018] (BRCA). Both mod- els use a symmetrical encoder and a decoder. They encode a variable-length input sequence as a fixed-size latent represen- 3 Related Work tation, by applying a variable number of convolve-and-pool There are many methods for creating sentence embeddings, stages (Figure 1). Unlike in autoregressive models, the recur- the simplest being averaging word-embedding vectors of a sion is not applied over the input sequence length, but over the given sentence [Joulin et al., 2016; Le and Mikolov, 2014]. depth of the model. As a result, each recursive step processes SkipThought [Kiros et al., 2015] generalizes idea of unsu- all sequence elements in parallel. pervised learning of word2vec word embeddings [Mikolov et al., 2013]. It is implemented in the encoder-decoder setting 4.1 Encoder using LSTM networks. Given a triplet of consecutive sen- Our encoder uses two operation groups from the BRCA: a tences (si−1 , si , si+1 ), the encoder creates a fixed-size em- prefix group and a recursive group. The prefix group consists bedding vector of the sentence si , and the decoder tries to of N temporal convolutional layers, and the recursive group 15 consists of N temporal convolutional layers followed by a 5 Model Analysis max-pool layer with kernel size 2. For each sequence, the In this section we justify our design choices through a series prefix group is applied only once, while the recursive group of experiments with the BRCA model and report their out- is applied multiple times, sharing weights between all appli- comes.1 cations. All convolutional layers have d = 256 channels, kernels of size 3 and are organized into residual blocks, with 5.1 Data 2 layers per block, ReLU activations, residual connections, In order to produce comparable results, we prepared an En- and Batch Normalization (see Section 5.5 for details). glish Wikipedia dataset with similar sentence length distribu- The encoder of our model reads in text, by sentence or tion to [Zhang and LeCun, 2018]. Namely, we took at ran- by paragraph, as a sequence of discrete tokens (e.g. bytes, dom 11 million sentences from an English Wikipedia dump characters, words). Each input token is embedded as a fixed- extracted with WikiExtractor2 , so that their length distribu- length d-dimensional vector. Unlike [Zhang and LeCun, tion would roughly match that of [Zhang and LeCun, 2018] 2018], where input sequence is zero-padded to the nearest (Table 1). In experiments with random data, we generate ran- power of 2, we pad the input to a fixed length 2K , distributing dom strings of a-zA-Z0-9 ASCII characters. the characters evenly across the input. We motivate this deci- sion by the finding that the model zero-padded to the nearest Table 1: Lengths of paragraphs in the English Wikipedia dataset power of 2 does not generalize to sentences longer than those seen in the training data (see Section 5.2). First, the prefix group is applied, which retains the dimen- Length Percentage sionality and the number of channels, i.e., sequence length 4-63 B 35% and embedding size. Let d · 2r be the dimensionality of the 64-127 B 14% latent code output by the encoder. The encoder then applies 128-255 B 20% the recursive group K − r times. Note that with r one may 256-511 B 18% control size of a latent vector (the level of the compression). 512-1023 B 14% With the max-pooling, every application halves the length of the sequence. Weights are shared between applications. Fi- nally, the encoder outputs a latent code of size d · 2r . Un- like [Zhang and LeCun, 2018], we do not apply any linear 5.2 Model Capacity layers after recursions. Our experiments have shown that they Natural language texts are highly compressible due to their slightly degrade the performance of the model, and constitute low entropy, which results from redundancy of the lan- the majority of its parameters. guage [Levitin and Reingold, 1994]. In spite of this, the con- sidered models struggle to auto-encode 1024-byte short para- 4.2 Decoder graphs into 1024-float latent vectors, which are 4096-byte The decoder acts in reverse. First, it applies a recursive given their sheer information content. Transition from dis- group consisting of a convolutional layer which doubles the crete to continuous representation and inherent inefficiency number of channels to 2d, which is followed by an expand of the model are likely to account for some of this overhead. transformation [Zhang and LeCun, 2018] and N − 1 convo- One can imagine an initialization of weights that, given the lutional layers. Then it applies a postfix group of N temporal over-capacitated latent representation, would make the net- convolutional layers. Similarly to the encoder, the layers are work perform identity for paragraphs up to 128 bytes long3 . organized into residual blocks with ReLU activations, Batch We confirmed those speculations experimentally, training Normalization, and have the same dimensionality and ker- models on paragraphs of random printable ASCII characters, nel size. We double the size of input in the residual connec- namely random strings of a-zA-Z0-9 symbols (Table 2). tion, which bypasses the first two convolutions of the recur- The empirical capacity of our model is 128 bytes, which sive group, by stacking it with itself. We found it crucial for sheds light on the amount of overhead. This model has to convergence speed to use only residual blocks in the network, be trained on paragraphs longer than 512 bytes in order to also in the expand block. learn useful, compressing behavior given a 1024-float latent The decoder applies its recursive group K − r times. Each representation. application doubles the numbers of channels, while the ex- 5.3 Generalization to Longer Sequences pand transformation reorders and reshapes the data to effec- tively double the length of the sequence while the number of Auto-encoding RNN models such as LSTM are known to de- channels is unchanged and equals d. The postfix group pro- teriorate gradually with longer sequences [Cho et al., 2014; cesses a tensor of size 2K × d and retains its dimensionality. Chorowski et al., 2015]. We trained a BRCA model (N = 2) The output is interpreted as 2K probability distributions over and a LSTM encoder-decoder network with hidden size 256. possible output elements. Adding an output embedding layer, 1 Source code of our models is available: either tied with input embedding layer or separate, slowed https://github.com/smalik169/ down training and did not improve the results. At the end, recursive-convolutional-autoencoder a Softmax layer is used to compute output probabilities over 2 https://github.com/attardi/wikiextractor possible bytes. Note that output probabilities are independent 3 When max-pooling is replaced by convolution with stride 2 and from one another conditioned on the input. kernel size 2 16 Table 2: Learning identity by training on random sequences of home home went went Tom Tom ASCII characters of different length. Accuracy is presented for BRCA (N=8) model. (a) Unbalanced padding to se- (b) Unbalanced padding to se- quence of length 8 quence length 16 Training Lengths Test Length Accuracy home home went went Tom Tom 4 – 128 128 99.81% 128 60.79% 4 – 512 256 22.99% (c) Balanced padding to se- (d) Balanced padding to se- 512 9.81% quence of length 8 quence length 16 Figure 2: Unbalanced and balanced padding of an input sequence to Table 3: Comparison of the ability of BRCA and LSTM encoder- a fixed-length sequence. Grey boxes are (zero) padding, white boxes decoder to learn an identity function and generalize to unseen data. are input embedding vectors. Values represent byte-level decoding accuracy. Note that the LSTM decoder has the advantage of always being primed with the correct prefix sequence. 2 and distribute the remaining padding equally in between the bytes. We hypothesized that it could free convolutional layers from the burden of propagating the signal from left to right in Lengths (bytes) BRCA (N=2) LSTM-LSTM order to fill the whole latent vector, as it would be the case, 9-16 97.06% 91.17% e.g., when processing a 64-byte paragraph padded with 960 17-32 97.96% 90.20% empty tokens from the right to form 1024-byte input. Empir- 33-64 97.45% 91.72% ically, this trades additional computations for better conver- 65-128 83.56% 86.34% gence characteristics. 129-256 11.66% 72.88% 257-512 8.05% 58.80% 5.5 Batch Normalization Fixed-length, balance-padded inputs allow easy mixing of paragraphs of different lengths within a batch, in consequence Both models were trained on sentences of length up to 128 allowing raising the batch size, applying Batch Normalization bytes and evaluated on unseen data. The LSTM model did and raising the learning rate. This enables a significant speed- not perfectly learn the identity function, even though it was up in convergence and better auto-encoding accuracy (see solving an easier task of predicting the character given the Section 5.6). However, the statistics collected by BN layers correct prefix. However, the LSTM model generalized much differ during each of the K − r recursive steps, even though better on longer sequences, where performance of BRCA de- the weights of convolutions in the recursive layers are shared. teriorated rapidly (Table 3). This breaks auto-encoding during inference time, when BN layers have fixed mean and standard deviation collected over 5.4 Balanced Padding of Input Sequences a large dataset. We propose to alleviate this issue by either: We found BRCA difficult to train. The default hyperparame- a) collecting separate statistics for each recursive application ters given by the authors [Zhang and LeCun, 2018] are single- and each input length separately, or b) placing a paragraph sample batches, SGD with momentum 0.9, and a small learn- inside a batch of data drawn from the training corpus during ing rate 0.01 with 100 epochs of training. In our preliminary inference and calculating the mean and the standard devia- experiments, increasing the batch size by batching paragraphs tion on this batch. We also experimented with the instance of the same length improved convergence on datasets with normalization [Ulyanov et al., 2016], which performs the nor- short sentences (mostly up to 256 bytes long), but otherwise malization of features of a single instances, rather than of a deteriorated on the Wikipedia dataset, where roughly 50% whole minibatch. We have found that the instance normaliza- paragraphs are longer than 256 bytes. We suspect that the tion improved greatly upon the baseline model with no nor- difficulty lies in the difference of the underlying tasks: long malization, but performed worse than batch normalization. paragraphs require compressive behavior, while short ones BRCA has been introduced with linear layers in the post- merely require learning the identity function. Updating net- fix/prefix groups of the encoder/decoder. In our experiments, work parameters towards one tasks hinders the performance removing those layers from the vanilla BRCA lowered accu- on the others, hence the necessity for careful training. racy by a few percentage points. Conversely, our model bene- In order to blend in both tasks, we opted for padding input fits from not having linear layers. We observed faster conver- sequences into fixed-length vectors. We find it sensible to gence and better accuracy without them, while reducing the fix maximum length of input sentence, since the model does number of parameters from 23.4 million to 6.67 million. not generalize to unseen lengths anyway. Variable length of input in BRCA does save computations, however we found 5.6 Auto-Encoding Performance fixing input size to greatly improve training time, despite the Our training setup is comparable with that of BRCA [Zhang overhead. and LeCun, 2018]. In each epoch, we randomly select 1 mil- In order to make the tasks more similar, we propose bal- lion sentences from the training corpus. We trained using anced padding of the inputs (Figure 2). Instead of padding SGD with momentum 0.5 in batches of 32 paragraphs of ran- from the right up to 2K bytes, we pad to the nearest power of dom length, balanced padded to 2r = 1024 tokens, including 17 (a) Input sentence: One of Lem’s major recurring themes, beginning from his very first novel, ”The Man from Mars” (. . . ) (b) Input sentence: Typical fuel is denatured alcohol, methanol, or isopropanol (. . . ) Figure 3: Input-output byte relations (X axis vs. Y axis) as indicated by the method of Integrated Gradients [Sundararajan et al., 2017] with 50 integration points. The plots correspond to (a) 659-byte, and (b) 128-byte Wikipedia paragraphs. The leftmost plots show relations between all input-output bytes, the middle plots for the first 64 bytes. The rightmost plots also plot spaces. Dark shades indicate strong relations -.those lay along diagonals and do not cross word and phrases boundaries. a special end-of-sequence token. The training was run for 16 18% Ours (clipping, BN, 5 epochs) 17.22 Ours (BN, 16 epochs) epochs, and learning rate was multiplied by 0.1 every epoch 16% Ours(static, BN, 16 epochs) Ours(IN, 16 epochs) after the 10th epoch. The model suffered from the explod- 14% BRCA ing gradient problem [Hochreiter et al., 2001], and gradient Reconstruction error 12% 11.18 clipping stabilized the training, enabling even higher learning 10% 9.57 8.53 rates. With clipping, we were able to set the learning rate as 8% 8.24 high as 30.0, cutting down training time to as low as 5 epochs. 6% Figure 4 shows auto-encoding error rate on the test set 4% by sentence length. Our best model achieved 1.5% test er- 2.52 1.41 ror, computed as average byte-level error on the English 2% 0.03 0.08 0.04 0.230.44 0.42 1.02 0.61 0.02 0.02 0.05 0.03 0.02 Wikipedia dataset. 0% 65-128 B 129-256 B 257-512 B 513-1024 B Finally, we were able to train a static version of our model Paragraph length (i.e., with no shared weights in the recursion group) in com- parable time, closing a huge gap in convergence of recursive Figure 4: Decoding errors on unseen data for our best models (N = and static models in vanilla BRCA. 8, no linear layers) with balanced input padding to a sequence of size 1024 compared with Byte-Level Recursive Convolutional Auto- Encoder (BRCA) 5.7 Generalization We investigated which inputs influence correct predictions of the network using the method of Integrated Gradients [Sun- the order is mostly preserved in the latent vector. dararajan et al., 2017]. We have produced two heatmaps Early in the training the model learns to output only spaces, of input-output relationships for short (128 bytes) and long which are the most common bytes in an average Wikipedia (1024 bytes) paragraphs in our best model (Figure 3). In paragraph. Later during training, it learns to correctly rewrite theory, a model performing identity should have a diagonal spaces, while filling in the words with vowels, which are the heatmap. Our model finds relations within bytes of individ- most frequent non-space characters. Interestingly, the com- ual words, rarely crossing word and phrases boundaries. In pressing behavior seems to be language-specific and triggered this sense, it fails to exploit the ordering of words. However, only by longer sequences. Figure 5 presents input sentences 18 In : 9H3cxn4RIRUnOuPymw28dxUoA060LQ3heq1diKcbiUoinkzDjxucnE3Hk7FEFwHjzcTlOrhPUp3kgt9y8VAaw1sYpjPO9N5Cv4IAn Out: 9H3cxn4RIRUnOuPymw28dxUoA060LQ3heqddiKcbiUoinkzDjxucnE3Hk7FEFwHjzcTlOrhPUp3kgt9y8VAaw1sYpjPO9N5Cv4IAn (a) Random string of characters (under 128 bytes) In : Lorsque ce m\xc3\xa9lange de cultures mondiales doit donner une signification au fa it d’\xc3\xaatre humain, ils r\xc3\xa9pondent avec 10,000 voix diff\xc3\xa9rentes. Out: Lorsque ce m\xc3\xa9lange de cultures mondiales doit donner une signification au fa it d’\xc3\xaatre humain, ils r\xc3\xa9pondent avec 10,000 voix diff\xc3\xa9rentes. In : When we first sequenced this genome in 1995, the standard of accuracy was one error per 10,000 base pairs. Out: When we first sequenced this genome in 1995, the standard of accuracy was one error per 10,000 base pairs. (b) French and English sentences (under 256 and 128 bytes respectively) In : Lorsque ce m\xc3\xa9lange de cultures mondiales doit donner une signification au fa it d’\xc3\xaatre humain, ils r\xc3\xa9pondent avec 10,000 voix diff\xc3\xa9rentes.L Out: Larkere de Bu\xa9 lande by mortures mondiales leid dunner one mignification or Za an ’’\xc3u ari Rumains and pe e rachant open (0,000 seid disar s senge. M In : When we first sequenced this genome in 1995, the standard of accuracy was one error per 10,000 base pairs.When we first sequence d this genome in 1995, the standard of Out: When we first sequenced this genome in 1995, the standard of accuracy was one error per 10,000 base pair..PWew we first sequence d this genome in 1995, the standard of (c) French and English sentences concatenated 4 times to form a longer input (only a prefix is shown above) Figure 5: Auto-encoding capabilities of the model with errors marked in bold red. The model was trained only on English Wikipedia paragraphs. On short sequences, our model performs close to an identity function. On longer ones, it seems to correctly auto-encode only English paragraphs. Note that the model tries to map French words into English ones (avec → open, une → one). We observed a similar behavior on other languages as well. and auto-encoded outputs of our best model, trained on En- sentence, where m is the length of the sentence, wi is its i-th glish Wikipedia, for English, French and random input se- word, and e(w) is the GloVe embedding of the word w. Final quences. embedding is the sum x = v + u. Table 4 presents results for word-level recursive convolu- 6 Word-Level Sentence Encoder tional encoder (WRCE), word-level model with fixed bal- anced padding (Ours), and an ensemble of our model and Following the methods and work of [Conneau et al., 2017], an average embedding of the input sequence (Ours + BoW). we apply our architecture to a practical task. Namely, we train We compare them with a baseline model (BoW - average of models consisting of the recursive convolutional word-level GloVe vectors for words in a sentence) on SNLI and other encoder and a simple three-layer fully-connected classifier on classification tasks, SICK-Relatedness [Marelli et al., 2014], Stanford Natural Language Inference (SNLI) corpus [Bow- and STS{12-16} tasks. The SentEval5 tool was used for these man et al., 2015]. This dataset contains 570k sentence pairs, experiments. each one described by one of three relation labels: entailment, For certain tasks, especially those measuring textual simi- contradiction, and neutral. Then we test encoders on various larity, which are useful in retrieval-based response generation transfer tasks measuring semantic similarities between sen- in dialogue systems, presented models perform better than tences. bag-of-words. However, they are still not on par with LSTM- The encoder of each model has a similar architecture to the based methods [Conneau et al., 2017; Kiros et al., 2015] that previously described byte-level encoder. However, instead of generate more robust embeddings. LSTM models are au- bytes it takes words as its input sequence. Our best encoder toregressive and thus require slow sequential computations. has N = 8 layers in each group. The recursive group is ap- They are also larger, with the InferSent model [Conneau et plied K times where 2K is length of a padded input sequence, al., 2017] having over 30 times more parameters than convo- so that the latent vector is of the size of a word vector. We use lutional encoders presented in this section. In addition, our pre-trained GloVe vectors4 and we do not fine-tune them. We architecture can share word embedding matrices with other compared both fixed-length balanced, and variable length in- components of a conversational system, since word embed- put paddings. In fixed-length padding, up to first 64 words dings are ubiquitous in different modules of NLP systems. are taken from each sentence. We also compare ensemble of In order to qualitatively assess how the results for those our best trained model and bag-of-words as a sentence rep- tasks transfer to the actual dialogue system, we have com- 1 PmLet v be the output vector of the encoder, and resentation. pared some retrieved responses of a simple retrieval-based u = m i=1 e(wi ) be the average of word vectors of the 5 https://github.com/facebookresearch/ 4 https://nlp.stanford.edu/projects/glove/ SentEval 19 Table 4: Results for word-level sentence encoders. We compare bag-of-words (BoW), i.e. averaged word embeddings, WRCE - the encoder from Zhang and LeCun’s model on word-level, our word-level model with balanced padding to 64 elements (Ours), and an ensemble of our model and BoW (Ours + BoW) for various supervised (classification accuracy) and unsupervised (Pearson/Spearman correlation coefficients) tasks. Model Task (dev/test acc%) BoW WRCE Ours Ours + BoW SNLI 67.7 / 67.5 82.0 / 81.3 83.8 / 83.1 83.2 / 82.6 CR 79.7 / 78.0 78.0 / 77.3 78.6 / 77.0 79.1 / 78.2 MR 77.7 / 77.0 72.9 / 72.4 73.7 / 73.1 75.3 / 74.8 MPQA 87.4 / 87.5 85.9 / 85.6 86.0 / 85.9 87.4 / 87.6 SUBJ 91.8 / 91.4 86.1 / 85.4 87.2 / 86.9 89.0 / 88.9 SST Bin. Class. 80.4 / 81.4 78.1 / 77.5 77.2 / 76.7 78.1 / 78.8 SST Fine-Grained Class. 45.1 / 44.4 38.3 / 40.5 40.5 / 39.3 41.9 / 41.4 TREC 74.5 / 82.2 67.0 / 72.4 69.2 / 71.4 71.0 / 77.4 MRPC 74.4 / 73.2 72.4 / 71.1 73.5 / 72.5 74.1 / 73.3 SICK-E 79.8 / 78.2 82.6 / 82.8 83.6 / 81.9 83.2 / 83.0 Task (correlation) BoW WRCE Ours Ours + BoW SICK-R 0.80 / 0.72 0.85 / 0.78 0.87 / 0.80 0.86 / 0.80 STS12 0.53 / 0.54 0.56 / 0.57 0.60 / 0.60 0.62 / 0.61 STS13 0.45 / 0.47 0.55 / 0.54 0.53 / 0.54 0.57 / 0.58 STS14 0.53 / 0.54 0.65 / 0.63 0.68 / 0.70 0.69 / 0.66 STS15 0.56 / 0.59 0.68 / 0.69 0.70 / 0.70 0.71 / 0.72 STS16 0.52 / 0.57 0.69 / 0.70 0.70 / 0.72 0.71 / 0.73 agent, which matches user utterance with a single quote from successfully apply batch normalization with recursive layers Wikiquotes [Chorowski et al., 2018]. We present a compari- and investigate input-output relations with Integrated Gradi- son of our word-level sentence encoder with the bag-of-word ents method. method in response retrieval task (Figure 6). Human utter- The recursive convolutional architecture benefits from the ances from the training data of NIPS 2017 Conversational ease of training and low number of parameters. Due to our Challenge6 have been selected as input utterances. We match realization that in the current byte-level setting, input-output them with the closest quote from Wikiquotes, using a method relations rarely cross word boundaries, we demonstrate ap- similar to the one used in Poetwannabe chatbot [Chorowski plicability of the architecture in a word-level setting as a sen- et al., 2018]. All utterances have been filtered for foul speech tence embedder. Furthermore, a good performance on seman- (for details see [Chorowski et al., 2018]), tokenized using tic similarity tasks while using little resources demonstrates it Moses tokenizer7 , and embedded as vectors. For every user practical usefulness for dialog systems. input utterance, we return the closest matching quote from Wikiquotes with respect to the cosine similarity. Acknowledgments The authors thank Paweł Rychlikowski and Michał Zapo- 7 Discussion and Future Work toczny for fruitful discussions, and Xiang Zhang for help to- The problem of efficiently producing good and robust sen- wards a better understanding of the original model. The re- tence embeddings seems far from being solved. In this work, search was partially supported by National Science Center which we consider a step forward in exploration of possi- (Poland) grant Sonata 8 2014/15/D/ST6/04402. ble tools and solutions, we analyzed and proposed improve- ments to the BRCA model by Xiang Zhang and Yann Le- Cun [2018]. With extensive usage of convolutions, our model is efficient in terms of computation and memory usage. By analyzing BRCA we were able to diagnose problems with its training, such as exploding gradients, and understand the difficulty in auto-encoding of long paragraphs, especially in the initial stage of training. Furthermore, we showed how to 6 http://convai.io/2017/data/ 7 https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/tokenizer/ python-tokenizer/moses.py 20 User: Hello, have you studied algebra? BoW: I never failed in mathematics. Before I was fifteen I had mastered differential and integral calculus. Ours+BoW: I never failed in mathematics. Before I was fifteen I had mastered differential and integral calculus. User: From what I understand it’s a video game. BoW: Life is like a piano. What you get out of it depends on how you play it. Ours+BoW: From now on, I control all aspects of the game. User: God BoW: Knowledge of God is obedience to God. Ours+BoW: God and all attributes of God are eternal. User: Have you been to South Asia? BoW: We do not want chaos in South Africa. Ours+BoW: There is a wide knowledge gap between us and the developed world in the West and in Asia. User: How do you like Hackathon? BoW: Any thing you can do I can do better. Ours+BoW: Discover the things that you do that make you proud as hell. User: How do you think ? BoW: Any thing you can do I can do better. Ours+BoW: How you think matters more than what you think. User: I just didn’t catch what was context message about. BoW: I read it because I was asked to explain what the truth is. Ours+BoW: I spend so much time thinking about all the answers to the problem... that I forget what the problem actually was. User: I’m an idiot BoW: I am an Agnostic because I am not afraid to think. Ours+BoW: I wish I could say I was sorry. User: It’s classics! BoW: I love musical theatre and my dream is to do Once On This Island. Ours+BoW: No work which is destined to become a classic can look like the classics which have preceded it. User: So, start talking. BoW: Oh, ok, ok... Fair enough, yeah, rage it up. Rage all you want. Good things are coming. Good things. Ours+BoW: Many people talk much, and then very many people talk very much more. User: Technically correct BoW: Surely only correct understanding could lead to correct action. Ours+BoW: Where an opinion is general, it is usually correct. User: Thats why I play computer games alone. BoW: I have no time to play games. Ours+BoW: The only legitimate use of a computer is to play games. User: Well, can you dance? BoW: If I can mince , you can dance. Ours+BoW: Ah, so you wish to dance. User: What about ivy league? BoW: Ah wonder if anybody this side of the Atlantic has ever bought a baseball bat with playing baseball in mind. Ours+BoW: This is so far out of my league. Figure 6: Sample answers of retrieval-based agents which embed sentences as either BoWs, or BoWs combined with our method References In Proceedings of the 2015 Conference on Empirical [Bai et al., 2018] S. Bai, J. Zico Kolter, and V. Koltun. An Methods in Natural Language Processing (EMNLP). As- Empirical Evaluation of Generic Convolutional and Re- sociation for Computational Linguistics, 2015. current Networks for Sequence Modeling. ArXiv e-prints, [Cho et al., 2014] Kyunghyun Cho, Bart Van Merriënboer, March 2018. Dzmitry Bahdanau, and Yoshua Bengio. On the proper- [Bojanowski et al., 2016] Piotr Bojanowski, Edouard Grave, ties of neural machine translation: Encoder-decoder ap- Armand Joulin, and Tomas Mikolov. Enriching word vec- proaches. arXiv preprint arXiv:1409.1259, 2014. tors with subword information. CoRR, abs/1607.04606, [Chorowski et al., 2015] Jan K Chorowski, Dzmitry Bah- 2016. danau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua [Bowman et al., 2015] Samuel R. Bowman, Gabor Angeli, Bengio. Attention-based models for speech recognition. In Christopher Potts, and Christopher D. Manning. A large Advances in neural information processing systems, pages annotated corpus for learning natural language inference. 577–585, 2015. 21 [Chorowski et al., 2018] Jan Chorowski, Adrian Lancucki, [Pang and Lee, 2008] Bo Pang and Lillian Lee. Opinion Szymon Malik, Maciej Pawlikowski, Paweł Rychlikowski, mining and sentiment analysis. Foundations and Trends R and Paweł Zykowski. A Talker Ensemble: the University in Information Retrieval, 2(1–2):1–135, 2008. of Wrocław’s Entry to the NIPS 2017 Conversational In- [Pennington et al., 2014] Jeffrey Pennington, Richard telligence Challenge. CoRR, abs/1805.08032, 2018. Socher, and Christopher D. Manning. Glove: Global [Conneau et al., 2017] Alexis Conneau, Douwe Kiela, Hol- vectors for word representation. In Empirical Meth- ger Schwenk, Loı̈c Barrault, and Antoine Bordes. Super- ods in Natural Language Processing (EMNLP), pages vised learning of universal sentence representations from 1532–1543, 2014. natural language inference data. CoRR, abs/1705.02364, [Pichl et al., 2018] Jan Pichl, Petr Marek, Jakub Konrád, 2017. Martin Matulı́k, Hoang Long Nguyen, and Jan Se- [He et al., 2015] Kaiming He, Xiangyu Zhang, Shaoqing divý. Alquist: The alexa prize socialbot. CoRR, Ren, and Jian Sun. Deep residual learning for image recog- abs/1804.06705, 2018. nition. CoRR, abs/1512.03385, 2015. [Ram et al., 2018] Ashwin Ram, Rohit Prasad, Chandra [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Jürgen Schmidhuber. Long short-term memory. Neural Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, Comput., 9(8):1735–1780, November 1997. Eric King, Kate Bland, Amanda Wartick, Yi Pan, Han [Hochreiter et al., 2001] Sepp Hochreiter, Yoshua Bengio, Song, Sk Jayadevan, Gene Hwang, and Art Pettigrue. Con- Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow versational AI: the science behind the alexa prize. CoRR, in recurrent nets: the difficulty of learning long-term de- abs/1801.03604, 2018. pendencies, 2001. [Redmon et al., 2015] Joseph Redmon, Santosh Kumar Div- [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian vala, Ross B. Girshick, and Ali Farhadi. You only Szegedy. Batch normalization: Accelerating deep net- look once: Unified, real-time object detection. CoRR, work training by reducing internal covariate shift. CoRR, abs/1506.02640, 2015. abs/1502.03167, 2015. [Ritter et al., 2011] Alan Ritter, Colin Cherry, and [Joulin et al., 2016] Armand Joulin, Edouard Grave, Piotr William B. Dolan. Data-driven response generation Bojanowski, and Tomas Mikolov. Bag of tricks for effi- in social media. In Proceedings of the Conference on cient text classification. CoRR, abs/1607.01759, 2016. Empirical Methods in Natural Language Processing, EMNLP ’11, pages 583–593, Stroudsburg, PA, USA, [Kiros et al., 2015] Ryan Kiros, Yukun Zhu, Ruslan 2011. Association for Computational Linguistics. Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. [Serban et al., 2017] Iulian Vlad Serban, Chinnadhurai CoRR, abs/1506.06726, 2015. Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, Sarath Chandar, Nan Rosemary Ke, Sai Mudumba, and Geoffrey E. Hinton. Imagenet classification with deep Alexandre de Brébisson, Jose Sotelo, Dendi Suhubdy, convolutional neural networks. In Proceedings of the 25th Vincent Michalski, Alexandre Nguyen, Joelle Pineau, and International Conference on Neural Information Process- Yoshua Bengio. A deep reinforcement learning chatbot. ing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, CoRR, abs/1709.02349, 2017. 2012. Curran Associates Inc. [Sundararajan et al., 2017] Mukund Sundararajan, Ankur [Le and Mikolov, 2014] Quoc V. Le and Tomas Mikolov. Taly, and Qiqi Yan. Axiomatic attribution for deep net- Distributed representations of sentences and documents. works. arXiv preprint arXiv:1703.01365, 2017. CoRR, abs/1405.4053, 2014. [Ulyanov et al., 2016] Dmitry Ulyanov, Andrea Vedaldi, and [Levitin and Reingold, 1994] Lev B Levitin and Zeev Rein- Victor S. Lempitsky. Instance normalization: The miss- gold. Entropy of natural languages: Theory and experi- ing ingredient for fast stylization. CoRR, abs/1607.08022, ment. Chaos, Solitons & Fractals, 4(5):709 – 743, 1994. 2016. [Liu et al., 2017] Huiting Liu, Tao Lin, Hanfei Sun, Weijian [van den Oord et al., 2017a] Aäron van den Oord, Yazhe Lin, Chih-Wei Chang, Teng Zhong, and Alexander I. Rud- Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, nicky. Rubystar: A non-task-oriented mixture model dia- Koray Kavukcuoglu, George van den Driessche, Ed- log system. CoRR, abs/1711.02781, 2017. ward Lockhart, Luis C. Cobo, Florian Stimberg, Norman [Marelli et al., 2014] Marco Marelli, Stefano Menini, Marco Casagrande, Dominik Grewe, Seb Noury, Sander Diele- Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto man, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Zamparelli. A sick cure for the evaluation of composi- Graves, Helen King, Tom Walters, Dan Belov, and Demis tional distributional semantic models. 05 2014. Hassabis. Parallel wavenet: Fast high-fidelity speech syn- [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Cor- thesis. CoRR, abs/1711.10433, 2017. rado, and Jeffrey Dean. Efficient estimation of word rep- [van den Oord et al., 2017b] Aäron van den Oord, Yazhe resentations in vector space. CoRR, abs/1301.3781, 2013. Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, 22 Koray Kavukcuoglu, George van den Driessche, Ed- ward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Diele- man, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel wavenet: Fast high-fidelity speech syn- thesis. CoRR, abs/1711.10433, 2017. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. [Weissenborn et al., 2017] Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Fastqa: A simple and efficient neural ar- chitecture for question answering. CoRR, abs/1703.04816, 2017. [Yang et al., 2018] Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Learn- ing semantic textual similarity from conversations. CoRR, abs/1804.07754, 2018. [Yusupov and Kuratov, 2017] Idris Yusupov and Yurii Kura- tov. Skill-based conversational agent. 12 2017. [Zhang and LeCun, 2018] X. Zhang and Y. LeCun. Byte- Level Recursive Convolutional Auto-Encoder for Text. ArXiv e-prints, February 2018. 23