Efficient Purely Convolutional Text Encoding

                              Szymon Malik∗ , Adrian Lancucki∗ , Jan Chorowski
                                       Institute of Computer Science
                                           University of Wrocław
                       szymon.w.malik@gmail.com, {alan,jch}@cs.uni.wroc.pl


                              Abstract                                 et al., 2018; van den Oord et al., 2017a; van den Oord et
                                                                       al., 2017b] among researchers and practitioners alike. In this
         In this work, we focus on a lightweight convolu-              work, we focus on a lightweight convolutional architecture
         tional architecture that creates fixed-size vector em-        that creates fixed-size representations of sentences.
         beddings of sentences. Such representations are                  Convolutional neural networks have the inherent ability to
         useful for building NLP systems, including con-               detect local structures in the data. In the context of conver-
         versational agents. Our work derives from a re-               sational systems, their speed and memory efficiency eases
         cently proposed recursive convolutional architec-             deployment on mobile devices, allowing fast response re-
         ture for auto-encoding text paragraphs at byte level.         trieval and better user experience. We analyze and build on
         We propose alternations that significantly reduce             the recently proposed Byte-Level Recursive Convolutional
         training time, the number of parameters, and im-              Auto-Encoder (BRCA) for text paragraphs [Zhang and Le-
         prove auto-encoding accuracy. Finally, we evaluate            Cun, 2018], which is able to auto-encode text paragraphs into
         the representations created by our model on tasks             fixed-size vectors, reading in bytes with no additional pre-
         from SentEval benchmark suite, and show that it               processing.
         can serve as a better, yet fairly low-resource alter-            Based on our analysis, we are able to explain the behav-
         native to popular bag-of-words embeddings.                    ior of this model, point out possible enhancements, achieve
                                                                       auto-encoding accuracy improvements and an order of mag-
1       Introduction                                                   nitude training speed-up, while cutting down the number of
                                                                       parameters by over 70%. We introduce a balanced padding
Modern conversational agents often make use of retrieval-              scheme for input sequences and show that it significantly im-
based response generation modules [Ram et al., 2018], in               proves convergence and capacity of the model. As we find
which the response of the agent is retrieved from a curated            byte-level encoding unsuitable for embedding sentences, we
database. The retrieval can be implemented as similarity               demonstrate its applicability in processing sentences at word-
matching in a vector space, in which natural language sen-             level. We train the encoder with supervision on Stanford
tences are represented as fixed-size vectors. Cosine and Eu-           Natural Language Inference corpus [Bowman et al., 2015;
clidean distances typically serve as similarity measures. Such         Conneau et al., 2017] and investigate its performance on var-
approaches have been applied by participants of recent chat-           ious transfer tasks to assess quality of produced embeddings.
bot contests: The 2017 Alexa Prize [Pichl et al., 2018; Liu               The paper is structured as follows: in Section 2 we intro-
et al., 2017; Serban et al., 2017], and The 2017 NIPS Con-             duce some of the notions that appear in the paper. Section 3
versational Intelligence Challenge [Chorowski et al., 2018;            discusses relevant work for sentence vector representations.
Yusupov and Kuratov, 2017]. Retrieval-based modules are                Details of the architecture can be found in Section 4. Sec-
fast and predictable. Most importantly, they enable soft               tion 5 presents the analysis of the auto-encoder and motiva-
matching between representations. Apart from this straight-            tions for our improvements. Section 6 demonstrates super-
forward application in dialogue systems, sentence embed-               vised training of word-level sentence encoder, and evaluates
dings are applicable in downstream NLP tasks relevant to di-           it on tasks relevant to conversational systems. Section 7 con-
alogue systems. Those include sentiment analysis [Pang and             cludes the paper.
Lee, 2008], question answering [Weissenborn et al., 2017],
censorship [Chorowski et al., 2018], or intent detection.
   Due to the temporal nature of natural languages, recurrent          2   Preliminaries
neural networks gained popularity in NLP tasks. Active re-             An information retrieval conversational agent selects a re-
search of different architectures led to great advances and            sponse from a fixed set. Let D = {(ik , rk )}k be a set of
eventually a shift towards methods using the transformer ar-           conversational input-response pairs, and q be a current user’s
chitecture [Vaswani et al., 2017] or convolutional layers [Bai         input. Two simple ways of retrieving a response rk from
    ∗                                                                  available data are [Ritter et al., 2011; Chorowski et al., 2018]:
        Equal contribution


                                                                  14
                                                                                                     Output byte probabilities                                Output byte probabilities
   • return rk most similar to user’s input q,
   • return rk for which ik is most similar to q.                                                          Softmax                                                  Softmax
Utterances ik , rk , q may be represented (embedded) as real-


                                                                                  Post x


                                                                                                                                                Post x
                                                                                                                                                          
valued vectors.                                                                                              Conv                xN                               Conv & BN               xN
                                                                                                

                                                                        DECODER
   Many NLP systems represent words as points in a continu-


                                                                                  Recursive


                                                                                                                                                Recursive
ous vector space using word embedding methods [Mikolov et                                                    Conv                x N-1                            Conv & BN               x N-1
                                                                                                                                    Applied
al., 2013; Pennington et al., 2014; Bojanowski et al., 2016].                                           Expand Conv                K-r times
                                                                                                                                   (variable)
                                                                                                                                                              Expand Conv & BN
                                                                                                                                                                                            Applied
                                                                                                                                                                                            K-r times

They are calculated based on co-occurrence of words in large


                                                                                  Pre x
corpora. The same methods were applied to obtain sentence                               


                                                                                                             Linear              xN
embeddings only with partial success, due to the combinato-
                                                                                                         Feature Vector                                           Feature Vector
rial explosion of all possible word combinations, which make
up a sentence. Instead, Recurrent Neural Networks (RNNs),


                                                                                  Post x
                                                                                            

                                                                                                            Linear               xN
autoregressive models that can process input sequences of an


                                                                        ENCODER
arbitrary length, are thought to be a good method for handling                                                                     Applied


                                                                                  Recursive


                                                                                                                                                Recursive
                                                                                                           Max Pool                K-r times                        Max Pool                Applied
variable-length textual data, with Long Short-Term Memory                                                                          (variable)                                               K-r times

                                                                                                             Conv                xN                               Conv & BN               xN
network (LSTM) [Hochreiter and Schmidhuber, 1997] being
the prime example of such.


                                                                                  Pre x


                                                                                                                                                Pre x
                                                                                                             Conv                xN                              Conv & BN               xN
   Recently, RNNs have been reported to be successfully
                                                                                        

replaced by convolutional architectures [Bai et al., 2018;
van den Oord et al., 2017a; van den Oord et al., 2017b]. Con-                                        Byte Embeddings                                          Byte Embeddings

volutional neural networks are traditionally associated with                                                               2K input tokens                                          2K input tokens
computer vision and image processing [Krizhevsky et al.,                                                                   (variable length)                                        ( xed length)

2012; Redmon et al., 2015]. They primarily consist of con-                                                (a) BRCA                                              (b) Our model
volutional layers that apply multiple convolutions to the in-
put, followed by pooling layers that are used for reducing the          Figure 1: Structural comparison of Byte-Level Recursive Convolu-
dimensionality of the hidden state. Convolutional networks              tional Auto-Encoder (BRCA) and our model. Dark boxes indicate
are efficient during training and inference: they utilize few           input padding. BRCA pads the input form the right to the nearest
parameters and do not require sequential computations, mak-             power of two. We pad the input evenly to a fixed-size vector. Our
ing hardware parallelism easy to use. Due to their popularity           model does not have postfix/prefix groups with linear layers and uses
                                                                        Batch Normalization (BN) after every layer.
in image processing, there are efficient implementations that
scale well.
   Residual connection [He et al., 2015] is a connection that           generate sentences si−1 and si+1 from this representation.
adds an unchanged input to the output of the layer or block             SkipThought vectors have been shown to preserve syntactic
of layers. During the forward pass it provides upper layers             and semantic properties [Kiros et al., 2015], so that their
with undistorted signal from the input and intermediate lay-            similarity is represented in the embedding space.
ers. During the backward pass it mitigates vanishing and ex-               InferSent model [Conneau et al., 2017] shows that training
ploding gradient problems [Hochreiter et al., 2001].                    embedding systems with supervision on a natural language
   Batch Normalization [Ioffe and Szegedy, 2015] (BN) ap-               inference task may be superior to an unsupervised training.
plies normalization to all activations in every minibatch.              Recently, better results were obtained by combining super-
Typical operations used in neural networks are sensitive to             vised learning on an auxiliary natural language inference cor-
changes in the range and magnitude of inputs. During train-             pus with learning to predict the correct response on a conver-
ing the inputs to upper layers vary greatly due to changes in           sational data corpus [Yang et al., 2018].
weights of the model. Normalization of the signal in each
layer has the potential to alleviate this problem. Both BN              4                           Model Description
and residual connections enable faster convergence by help-
ing with forward and backward flow of information. They                 Our model builds on the Byte-Level Recursive Convolutional
were also crucial in training our models.                               Auto-Encoder [Zhang and LeCun, 2018] (BRCA). Both mod-
                                                                        els use a symmetrical encoder and a decoder. They encode a
                                                                        variable-length input sequence as a fixed-size latent represen-
3   Related Work                                                        tation, by applying a variable number of convolve-and-pool
There are many methods for creating sentence embeddings,                stages (Figure 1). Unlike in autoregressive models, the recur-
the simplest being averaging word-embedding vectors of a                sion is not applied over the input sequence length, but over the
given sentence [Joulin et al., 2016; Le and Mikolov, 2014].             depth of the model. As a result, each recursive step processes
SkipThought [Kiros et al., 2015] generalizes idea of unsu-              all sequence elements in parallel.
pervised learning of word2vec word embeddings [Mikolov et
al., 2013]. It is implemented in the encoder-decoder setting            4.1                          Encoder
using LSTM networks. Given a triplet of consecutive sen-                Our encoder uses two operation groups from the BRCA: a
tences (si−1 , si , si+1 ), the encoder creates a fixed-size em-        prefix group and a recursive group. The prefix group consists
bedding vector of the sentence si , and the decoder tries to            of N temporal convolutional layers, and the recursive group


                                                                   15
consists of N temporal convolutional layers followed by a               5        Model Analysis
max-pool layer with kernel size 2. For each sequence, the               In this section we justify our design choices through a series
prefix group is applied only once, while the recursive group            of experiments with the BRCA model and report their out-
is applied multiple times, sharing weights between all appli-           comes.1
cations. All convolutional layers have d = 256 channels,
kernels of size 3 and are organized into residual blocks, with          5.1       Data
2 layers per block, ReLU activations, residual connections,             In order to produce comparable results, we prepared an En-
and Batch Normalization (see Section 5.5 for details).                  glish Wikipedia dataset with similar sentence length distribu-
   The encoder of our model reads in text, by sentence or               tion to [Zhang and LeCun, 2018]. Namely, we took at ran-
by paragraph, as a sequence of discrete tokens (e.g. bytes,             dom 11 million sentences from an English Wikipedia dump
characters, words). Each input token is embedded as a fixed-            extracted with WikiExtractor2 , so that their length distribu-
length d-dimensional vector. Unlike [Zhang and LeCun,                   tion would roughly match that of [Zhang and LeCun, 2018]
2018], where input sequence is zero-padded to the nearest               (Table 1). In experiments with random data, we generate ran-
power of 2, we pad the input to a fixed length 2K , distributing        dom strings of a-zA-Z0-9 ASCII characters.
the characters evenly across the input. We motivate this deci-
sion by the finding that the model zero-padded to the nearest               Table 1: Lengths of paragraphs in the English Wikipedia dataset
power of 2 does not generalize to sentences longer than those
seen in the training data (see Section 5.2).
   First, the prefix group is applied, which retains the dimen-                                Length        Percentage
sionality and the number of channels, i.e., sequence length                                    4-63 B           35%
and embedding size. Let d · 2r be the dimensionality of the                                   64-127 B          14%
latent code output by the encoder. The encoder then applies                                   128-255 B         20%
the recursive group K − r times. Note that with r one may                                     256-511 B         18%
control size of a latent vector (the level of the compression).                              512-1023 B         14%
With the max-pooling, every application halves the length of
the sequence. Weights are shared between applications. Fi-
nally, the encoder outputs a latent code of size d · 2r . Un-
like [Zhang and LeCun, 2018], we do not apply any linear                5.2       Model Capacity
layers after recursions. Our experiments have shown that they           Natural language texts are highly compressible due to their
slightly degrade the performance of the model, and constitute           low entropy, which results from redundancy of the lan-
the majority of its parameters.                                         guage [Levitin and Reingold, 1994]. In spite of this, the con-
                                                                        sidered models struggle to auto-encode 1024-byte short para-
4.2   Decoder                                                           graphs into 1024-float latent vectors, which are 4096-byte
The decoder acts in reverse. First, it applies a recursive              given their sheer information content. Transition from dis-
group consisting of a convolutional layer which doubles the             crete to continuous representation and inherent inefficiency
number of channels to 2d, which is followed by an expand                of the model are likely to account for some of this overhead.
transformation [Zhang and LeCun, 2018] and N − 1 convo-                    One can imagine an initialization of weights that, given the
lutional layers. Then it applies a postfix group of N temporal          over-capacitated latent representation, would make the net-
convolutional layers. Similarly to the encoder, the layers are          work perform identity for paragraphs up to 128 bytes long3 .
organized into residual blocks with ReLU activations, Batch             We confirmed those speculations experimentally, training
Normalization, and have the same dimensionality and ker-                models on paragraphs of random printable ASCII characters,
nel size. We double the size of input in the residual connec-           namely random strings of a-zA-Z0-9 symbols (Table 2).
tion, which bypasses the first two convolutions of the recur-           The empirical capacity of our model is 128 bytes, which
sive group, by stacking it with itself. We found it crucial for         sheds light on the amount of overhead. This model has to
convergence speed to use only residual blocks in the network,           be trained on paragraphs longer than 512 bytes in order to
also in the expand block.                                               learn useful, compressing behavior given a 1024-float latent
   The decoder applies its recursive group K − r times. Each            representation.
application doubles the numbers of channels, while the ex-
                                                                        5.3       Generalization to Longer Sequences
pand transformation reorders and reshapes the data to effec-
tively double the length of the sequence while the number of            Auto-encoding RNN models such as LSTM are known to de-
channels is unchanged and equals d. The postfix group pro-              teriorate gradually with longer sequences [Cho et al., 2014;
cesses a tensor of size 2K × d and retains its dimensionality.          Chorowski et al., 2015]. We trained a BRCA model (N = 2)
The output is interpreted as 2K probability distributions over          and a LSTM encoder-decoder network with hidden size 256.
possible output elements. Adding an output embedding layer,                  1
                                                                             Source code of our models is available:
either tied with input embedding layer or separate, slowed              https://github.com/smalik169/
down training and did not improve the results. At the end,              recursive-convolutional-autoencoder
a Softmax layer is used to compute output probabilities over               2
                                                                             https://github.com/attardi/wikiextractor
possible bytes. Note that output probabilities are independent             3
                                                                             When max-pooling is replaced by convolution with stride 2 and
from one another conditioned on the input.                              kernel size 2


                                                                   16
Table 2: Learning identity by training on random sequences of


                                                                                                home


                                                                                                                                  home
                                                                                         went


                                                                                                                           went
                                                                                   Tom


                                                                                                                     Tom
ASCII characters of different length. Accuracy is presented for
BRCA (N=8) model.
                                                                            (a) Unbalanced padding to se-            (b) Unbalanced padding to se-
                                                                            quence of length 8                       quence length 16
         Training Lengths      Test Length      Accuracy


                                                                                                              home


                                                                                                                                                home
                                                                                                       went


                                                                                                                                         went
                                                                                   Tom


                                                                                                                     Tom
              4 – 128              128          99.81%
                                   128          60.79%
              4 – 512              256          22.99%                      (c) Balanced padding to se-              (d) Balanced padding to se-
                                   512           9.81%                      quence of length 8                       quence length 16

                                                                            Figure 2: Unbalanced and balanced padding of an input sequence to
Table 3: Comparison of the ability of BRCA and LSTM encoder-                a fixed-length sequence. Grey boxes are (zero) padding, white boxes
decoder to learn an identity function and generalize to unseen data.        are input embedding vectors.
Values represent byte-level decoding accuracy. Note that the LSTM
decoder has the advantage of always being primed with the correct
prefix sequence.                                                            2 and distribute the remaining padding equally in between the
                                                                            bytes. We hypothesized that it could free convolutional layers
                                                                            from the burden of propagating the signal from left to right in
      Lengths (bytes)      BRCA (N=2)        LSTM-LSTM
                                                                            order to fill the whole latent vector, as it would be the case,
           9-16              97.06%            91.17%                       e.g., when processing a 64-byte paragraph padded with 960
           17-32             97.96%            90.20%                       empty tokens from the right to form 1024-byte input. Empir-
           33-64             97.45%            91.72%                       ically, this trades additional computations for better conver-
          65-128             83.56%            86.34%                       gence characteristics.
         129-256             11.66%            72.88%
         257-512             8.05%             58.80%                       5.5   Batch Normalization
                                                                            Fixed-length, balance-padded inputs allow easy mixing of
                                                                            paragraphs of different lengths within a batch, in consequence
Both models were trained on sentences of length up to 128                   allowing raising the batch size, applying Batch Normalization
bytes and evaluated on unseen data. The LSTM model did                      and raising the learning rate. This enables a significant speed-
not perfectly learn the identity function, even though it was               up in convergence and better auto-encoding accuracy (see
solving an easier task of predicting the character given the                Section 5.6). However, the statistics collected by BN layers
correct prefix. However, the LSTM model generalized much                    differ during each of the K − r recursive steps, even though
better on longer sequences, where performance of BRCA de-                   the weights of convolutions in the recursive layers are shared.
teriorated rapidly (Table 3).                                               This breaks auto-encoding during inference time, when BN
                                                                            layers have fixed mean and standard deviation collected over
5.4   Balanced Padding of Input Sequences                                   a large dataset. We propose to alleviate this issue by either:
We found BRCA difficult to train. The default hyperparame-                  a) collecting separate statistics for each recursive application
ters given by the authors [Zhang and LeCun, 2018] are single-               and each input length separately, or b) placing a paragraph
sample batches, SGD with momentum 0.9, and a small learn-                   inside a batch of data drawn from the training corpus during
ing rate 0.01 with 100 epochs of training. In our preliminary               inference and calculating the mean and the standard devia-
experiments, increasing the batch size by batching paragraphs               tion on this batch. We also experimented with the instance
of the same length improved convergence on datasets with                    normalization [Ulyanov et al., 2016], which performs the nor-
short sentences (mostly up to 256 bytes long), but otherwise                malization of features of a single instances, rather than of a
deteriorated on the Wikipedia dataset, where roughly 50%                    whole minibatch. We have found that the instance normaliza-
paragraphs are longer than 256 bytes. We suspect that the                   tion improved greatly upon the baseline model with no nor-
difficulty lies in the difference of the underlying tasks: long             malization, but performed worse than batch normalization.
paragraphs require compressive behavior, while short ones                      BRCA has been introduced with linear layers in the post-
merely require learning the identity function. Updating net-                fix/prefix groups of the encoder/decoder. In our experiments,
work parameters towards one tasks hinders the performance                   removing those layers from the vanilla BRCA lowered accu-
on the others, hence the necessity for careful training.                    racy by a few percentage points. Conversely, our model bene-
   In order to blend in both tasks, we opted for padding input              fits from not having linear layers. We observed faster conver-
sequences into fixed-length vectors. We find it sensible to                 gence and better accuracy without them, while reducing the
fix maximum length of input sentence, since the model does                  number of parameters from 23.4 million to 6.67 million.
not generalize to unseen lengths anyway. Variable length of
input in BRCA does save computations, however we found                      5.6   Auto-Encoding Performance
fixing input size to greatly improve training time, despite the             Our training setup is comparable with that of BRCA [Zhang
overhead.                                                                   and LeCun, 2018]. In each epoch, we randomly select 1 mil-
   In order to make the tasks more similar, we propose bal-                 lion sentences from the training corpus. We trained using
anced padding of the inputs (Figure 2). Instead of padding                  SGD with momentum 0.5 in batches of 32 paragraphs of ran-
from the right up to 2K bytes, we pad to the nearest power of               dom length, balanced padded to 2r = 1024 tokens, including


                                                                       17
         (a) Input sentence: One of Lem’s major recurring themes, beginning from his very first novel, ”The Man from Mars” (. . . )


                           (b) Input sentence: Typical fuel is denatured alcohol, methanol, or isopropanol (. . . )

Figure 3: Input-output byte relations (X axis vs. Y axis) as indicated by the method of Integrated Gradients [Sundararajan et al., 2017]
with 50 integration points. The plots correspond to (a) 659-byte, and (b) 128-byte Wikipedia paragraphs. The leftmost plots show relations
between all input-output bytes, the middle plots for the first 64 bytes. The rightmost plots also plot spaces. Dark shades indicate strong
relations -.those lay along diagonals and do not cross word and phrases boundaries.


a special end-of-sequence token. The training was run for 16                                     18%
                                                                                                               Ours (clipping, BN, 5 epochs)                                               17.22
                                                                                                                       Ours (BN, 16 epochs)
epochs, and learning rate was multiplied by 0.1 every epoch                                      16%            Ours(static, BN, 16 epochs)
                                                                                                                         Ours(IN, 16 epochs)
after the 10th epoch. The model suffered from the explod-                                        14%                                  BRCA

ing gradient problem [Hochreiter et al., 2001], and gradient
                                                                          Reconstruction error


                                                                                                 12%                                                                                   11.18
clipping stabilized the training, enabling even higher learning                                  10%                                                                                9.57
                                                                                                                                                                                8.53
rates. With clipping, we were able to set the learning rate as                                   8%
                                                                                                                                                                            8.24

high as 30.0, cutting down training time to as low as 5 epochs.
                                                                                                 6%
   Figure 4 shows auto-encoding error rate on the test set
                                                                                                 4%
by sentence length. Our best model achieved 1.5% test er-                                                                                                            2.52
                                                                                                                                                                 1.41
ror, computed as average byte-level error on the English                                         2%
                                                                                                           0.03    0.08           0.04    0.230.44   0.42
                                                                                                                                                         1.02
                                                                                                                                                             0.61
                                                                                                       0.02    0.02    0.05   0.03    0.02
Wikipedia dataset.                                                                               0%
                                                                                                             65-128 B               129-256 B              257-512 B             513-1024 B
   Finally, we were able to train a static version of our model                                                                            Paragraph length

(i.e., with no shared weights in the recursion group) in com-
parable time, closing a huge gap in convergence of recursive              Figure 4: Decoding errors on unseen data for our best models (N =
and static models in vanilla BRCA.                                        8, no linear layers) with balanced input padding to a sequence of
                                                                          size 1024 compared with Byte-Level Recursive Convolutional Auto-
                                                                          Encoder (BRCA)
5.7   Generalization
We investigated which inputs influence correct predictions of
the network using the method of Integrated Gradients [Sun-                the order is mostly preserved in the latent vector.
dararajan et al., 2017]. We have produced two heatmaps                       Early in the training the model learns to output only spaces,
of input-output relationships for short (128 bytes) and long              which are the most common bytes in an average Wikipedia
(1024 bytes) paragraphs in our best model (Figure 3). In                  paragraph. Later during training, it learns to correctly rewrite
theory, a model performing identity should have a diagonal                spaces, while filling in the words with vowels, which are the
heatmap. Our model finds relations within bytes of individ-               most frequent non-space characters. Interestingly, the com-
ual words, rarely crossing word and phrases boundaries. In                pressing behavior seems to be language-specific and triggered
this sense, it fails to exploit the ordering of words. However,           only by longer sequences. Figure 5 presents input sentences


                                                                     18
          In :   9H3cxn4RIRUnOuPymw28dxUoA060LQ3heq1diKcbiUoinkzDjxucnE3Hk7FEFwHjzcTlOrhPUp3kgt9y8VAaw1sYpjPO9N5Cv4IAn
          Out:   9H3cxn4RIRUnOuPymw28dxUoA060LQ3heqddiKcbiUoinkzDjxucnE3Hk7FEFwHjzcTlOrhPUp3kgt9y8VAaw1sYpjPO9N5Cv4IAn


                                              (a) Random string of characters (under 128 bytes)

         In :    Lorsque ce m\xc3\xa9lange de cultures mondiales doit donner une signification au fa
                 it d’\xc3\xaatre humain, ils r\xc3\xa9pondent avec 10,000 voix diff\xc3\xa9rentes.
         Out:    Lorsque ce m\xc3\xa9lange de cultures mondiales doit donner une signification au fa
                 it d’\xc3\xaatre humain, ils r\xc3\xa9pondent avec 10,000 voix diff\xc3\xa9rentes.


         In :    When we first sequenced this genome in 1995, the standard of accuracy was one error per 10,000 base pairs.
         Out:    When we first sequenced this genome in 1995, the standard of accuracy was one error per 10,000 base pairs.

                                   (b) French and English sentences (under 256 and 128 bytes respectively)
                    In :   Lorsque ce m\xc3\xa9lange de cultures mondiales doit donner une signification au fa
                           it d’\xc3\xaatre humain, ils r\xc3\xa9pondent avec 10,000 voix diff\xc3\xa9rentes.L
                    Out:   Larkere de Bu\xa9   lande by mortures mondiales leid dunner one mignification or Za
                           an ’’\xc3u   ari Rumains and pe   e   rachant open (0,000 seid disar   s   senge. M

                    In :   When we first sequenced this genome in 1995, the standard of accuracy was one error
                           per 10,000 base pairs.When we first sequence d this genome in 1995, the standard of
                    Out:   When we first sequenced this genome in 1995, the standard of accuracy was one error
                           per 10,000 base pair..PWew we first sequence d this genome in 1995, the standard of

                  (c) French and English sentences concatenated 4 times to form a longer input (only a prefix is shown above)

Figure 5: Auto-encoding capabilities of the model with errors marked in bold red. The model was trained only on English Wikipedia
paragraphs. On short sequences, our model performs close to an identity function. On longer ones, it seems to correctly auto-encode only
English paragraphs. Note that the model tries to map French words into English ones (avec → open, une → one). We observed a similar
behavior on other languages as well.


and auto-encoded outputs of our best model, trained on En-                 sentence, where m is the length of the sentence, wi is its i-th
glish Wikipedia, for English, French and random input se-                  word, and e(w) is the GloVe embedding of the word w. Final
quences.                                                                   embedding is the sum x = v + u.
                                                                              Table 4 presents results for word-level recursive convolu-
6       Word-Level Sentence Encoder                                        tional encoder (WRCE), word-level model with fixed bal-
                                                                           anced padding (Ours), and an ensemble of our model and
Following the methods and work of [Conneau et al., 2017],                  an average embedding of the input sequence (Ours + BoW).
we apply our architecture to a practical task. Namely, we train            We compare them with a baseline model (BoW - average of
models consisting of the recursive convolutional word-level                GloVe vectors for words in a sentence) on SNLI and other
encoder and a simple three-layer fully-connected classifier on             classification tasks, SICK-Relatedness [Marelli et al., 2014],
Stanford Natural Language Inference (SNLI) corpus [Bow-                    and STS{12-16} tasks. The SentEval5 tool was used for these
man et al., 2015]. This dataset contains 570k sentence pairs,              experiments.
each one described by one of three relation labels: entailment,               For certain tasks, especially those measuring textual simi-
contradiction, and neutral. Then we test encoders on various               larity, which are useful in retrieval-based response generation
transfer tasks measuring semantic similarities between sen-                in dialogue systems, presented models perform better than
tences.                                                                    bag-of-words. However, they are still not on par with LSTM-
   The encoder of each model has a similar architecture to the             based methods [Conneau et al., 2017; Kiros et al., 2015] that
previously described byte-level encoder. However, instead of               generate more robust embeddings. LSTM models are au-
bytes it takes words as its input sequence. Our best encoder               toregressive and thus require slow sequential computations.
has N = 8 layers in each group. The recursive group is ap-                 They are also larger, with the InferSent model [Conneau et
plied K times where 2K is length of a padded input sequence,               al., 2017] having over 30 times more parameters than convo-
so that the latent vector is of the size of a word vector. We use          lutional encoders presented in this section. In addition, our
pre-trained GloVe vectors4 and we do not fine-tune them. We                architecture can share word embedding matrices with other
compared both fixed-length balanced, and variable length in-               components of a conversational system, since word embed-
put paddings. In fixed-length padding, up to first 64 words                dings are ubiquitous in different modules of NLP systems.
are taken from each sentence. We also compare ensemble of                     In order to qualitatively assess how the results for those
our best trained model and bag-of-words as a sentence rep-                 tasks transfer to the actual dialogue system, we have com-
       1
          PmLet v be the output vector of the encoder, and
resentation.                                                               pared some retrieved responses of a simple retrieval-based
u = m        i=1 e(wi ) be the average of word vectors of the
                                                                             5
                                                                               https://github.com/facebookresearch/
    4
        https://nlp.stanford.edu/projects/glove/                           SentEval


                                                                      19
Table 4: Results for word-level sentence encoders. We compare bag-of-words (BoW), i.e. averaged word embeddings, WRCE - the encoder
from Zhang and LeCun’s model on word-level, our word-level model with balanced padding to 64 elements (Ours), and an ensemble of our
model and BoW (Ours + BoW) for various supervised (classification accuracy) and unsupervised (Pearson/Spearman correlation coefficients)
tasks.

                                                                              Model
                       Task (dev/test acc%)             BoW           WRCE          Ours           Ours + BoW
                       SNLI                          67.7 / 67.5    82.0 / 81.3 83.8 / 83.1         83.2 / 82.6
                       CR                            79.7 / 78.0    78.0 / 77.3 78.6 / 77.0         79.1 / 78.2
                       MR                            77.7 / 77.0    72.9 / 72.4 73.7 / 73.1         75.3 / 74.8
                       MPQA                          87.4 / 87.5    85.9 / 85.6 86.0 / 85.9         87.4 / 87.6
                       SUBJ                          91.8 / 91.4    86.1 / 85.4 87.2 / 86.9         89.0 / 88.9
                       SST Bin. Class.               80.4 / 81.4    78.1 / 77.5 77.2 / 76.7         78.1 / 78.8
                       SST Fine-Grained Class.       45.1 / 44.4    38.3 / 40.5 40.5 / 39.3         41.9 / 41.4
                       TREC                          74.5 / 82.2    67.0 / 72.4 69.2 / 71.4         71.0 / 77.4
                       MRPC                          74.4 / 73.2    72.4 / 71.1 73.5 / 72.5         74.1 / 73.3
                       SICK-E                        79.8 / 78.2    82.6 / 82.8 83.6 / 81.9         83.2 / 83.0
                       Task (correlation)               BoW           WRCE          Ours           Ours + BoW
                       SICK-R                        0.80 / 0.72    0.85 / 0.78 0.87 / 0.80         0.86 / 0.80
                       STS12                         0.53 / 0.54    0.56 / 0.57 0.60 / 0.60         0.62 / 0.61
                       STS13                         0.45 / 0.47    0.55 / 0.54 0.53 / 0.54         0.57 / 0.58
                       STS14                         0.53 / 0.54    0.65 / 0.63 0.68 / 0.70         0.69 / 0.66
                       STS15                         0.56 / 0.59    0.68 / 0.69 0.70 / 0.70         0.71 / 0.72
                       STS16                         0.52 / 0.57    0.69 / 0.70 0.70 / 0.72         0.71 / 0.73


agent, which matches user utterance with a single quote from            successfully apply batch normalization with recursive layers
Wikiquotes [Chorowski et al., 2018]. We present a compari-              and investigate input-output relations with Integrated Gradi-
son of our word-level sentence encoder with the bag-of-word             ents method.
method in response retrieval task (Figure 6). Human utter-                 The recursive convolutional architecture benefits from the
ances from the training data of NIPS 2017 Conversational                ease of training and low number of parameters. Due to our
Challenge6 have been selected as input utterances. We match             realization that in the current byte-level setting, input-output
them with the closest quote from Wikiquotes, using a method             relations rarely cross word boundaries, we demonstrate ap-
similar to the one used in Poetwannabe chatbot [Chorowski               plicability of the architecture in a word-level setting as a sen-
et al., 2018]. All utterances have been filtered for foul speech        tence embedder. Furthermore, a good performance on seman-
(for details see [Chorowski et al., 2018]), tokenized using             tic similarity tasks while using little resources demonstrates it
Moses tokenizer7 , and embedded as vectors. For every user              practical usefulness for dialog systems.
input utterance, we return the closest matching quote from
Wikiquotes with respect to the cosine similarity.                       Acknowledgments
                                                                        The authors thank Paweł Rychlikowski and Michał Zapo-
7       Discussion and Future Work                                      toczny for fruitful discussions, and Xiang Zhang for help to-
The problem of efficiently producing good and robust sen-               wards a better understanding of the original model. The re-
tence embeddings seems far from being solved. In this work,             search was partially supported by National Science Center
which we consider a step forward in exploration of possi-               (Poland) grant Sonata 8 2014/15/D/ST6/04402.
ble tools and solutions, we analyzed and proposed improve-
ments to the BRCA model by Xiang Zhang and Yann Le-
Cun [2018]. With extensive usage of convolutions, our model
is efficient in terms of computation and memory usage. By
analyzing BRCA we were able to diagnose problems with
its training, such as exploding gradients, and understand the
difficulty in auto-encoding of long paragraphs, especially in
the initial stage of training. Furthermore, we showed how to
    6
   http://convai.io/2017/data/
    7
   https://github.com/moses-smt/
mosesdecoder/blob/master/scripts/tokenizer/
python-tokenizer/moses.py


                                                                   20
    User:          Hello, have you studied algebra?
    BoW:           I never failed in mathematics. Before I was fifteen I had mastered differential and integral calculus.
    Ours+BoW:      I never failed in mathematics. Before I was fifteen I had mastered differential and integral calculus.
    User:          From what I understand it’s a video game.
    BoW:           Life is like a piano. What you get out of it depends on how you play it.
    Ours+BoW:      From now on, I control all aspects of the game.
    User:          God
    BoW:           Knowledge of God is obedience to God.
    Ours+BoW:      God and all attributes of God are eternal.
    User:          Have you been to South Asia?
    BoW:           We do not want chaos in South Africa.
    Ours+BoW:      There is a wide knowledge gap between us and the developed world in the West and in Asia.
    User:          How do you like Hackathon?
    BoW:           Any thing you can do I can do better.
    Ours+BoW:      Discover the things that you do that make you proud as hell.
    User:          How do you think ?
    BoW:           Any thing you can do I can do better.
    Ours+BoW:      How you think matters more than what you think.
    User:          I just didn’t catch what was context message about.
    BoW:           I read it because I was asked to explain what the truth is.
    Ours+BoW:      I spend so much time thinking about all the answers to the problem... that I forget what the problem actually was.
    User:          I’m an idiot
    BoW:           I am an Agnostic because I am not afraid to think.
    Ours+BoW:      I wish I could say I was sorry.
    User:          It’s classics!
    BoW:           I love musical theatre and my dream is to do Once On This Island.
    Ours+BoW:      No work which is destined to become a classic can look like the classics which have preceded it.
    User:          So, start talking.
    BoW:           Oh, ok, ok... Fair enough, yeah, rage it up. Rage all you want. Good things are coming. Good things.
    Ours+BoW:      Many people talk much, and then very many people talk very much more.
    User:          Technically correct
    BoW:           Surely only correct understanding could lead to correct action.
    Ours+BoW:      Where an opinion is general, it is usually correct.
    User:          Thats why I play computer games alone.
    BoW:           I have no time to play games.
    Ours+BoW:      The only legitimate use of a computer is to play games.
    User:          Well, can you dance?
    BoW:           If I can mince , you can dance.
    Ours+BoW:      Ah, so you wish to dance.
    User:          What about ivy league?
    BoW:           Ah wonder if anybody this side of the Atlantic has ever bought a baseball bat with playing baseball in mind.
    Ours+BoW:      This is so far out of my league.

     Figure 6: Sample answers of retrieval-based agents which embed sentences as either BoWs, or BoWs combined with our method


References                                                               In Proceedings of the 2015 Conference on Empirical
[Bai et al., 2018] S. Bai, J. Zico Kolter, and V. Koltun. An             Methods in Natural Language Processing (EMNLP). As-
  Empirical Evaluation of Generic Convolutional and Re-                  sociation for Computational Linguistics, 2015.
  current Networks for Sequence Modeling. ArXiv e-prints,              [Cho et al., 2014] Kyunghyun Cho, Bart Van Merriënboer,
  March 2018.                                                            Dzmitry Bahdanau, and Yoshua Bengio. On the proper-
[Bojanowski et al., 2016] Piotr Bojanowski, Edouard Grave,               ties of neural machine translation: Encoder-decoder ap-
  Armand Joulin, and Tomas Mikolov. Enriching word vec-                  proaches. arXiv preprint arXiv:1409.1259, 2014.
  tors with subword information. CoRR, abs/1607.04606,                 [Chorowski et al., 2015] Jan K Chorowski, Dzmitry Bah-
  2016.                                                                  danau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua
[Bowman et al., 2015] Samuel R. Bowman, Gabor Angeli,                    Bengio. Attention-based models for speech recognition. In
  Christopher Potts, and Christopher D. Manning. A large                 Advances in neural information processing systems, pages
  annotated corpus for learning natural language inference.              577–585, 2015.


                                                                  21
[Chorowski et al., 2018] Jan Chorowski, Adrian Lancucki,              [Pang and Lee, 2008] Bo Pang and Lillian Lee. Opinion
   Szymon Malik, Maciej Pawlikowski, Paweł Rychlikowski,                 mining and sentiment analysis. Foundations and Trends R
   and Paweł Zykowski. A Talker Ensemble: the University                 in Information Retrieval, 2(1–2):1–135, 2008.
   of Wrocław’s Entry to the NIPS 2017 Conversational In-             [Pennington et al., 2014] Jeffrey Pennington,         Richard
   telligence Challenge. CoRR, abs/1805.08032, 2018.                     Socher, and Christopher D. Manning. Glove: Global
[Conneau et al., 2017] Alexis Conneau, Douwe Kiela, Hol-                 vectors for word representation. In Empirical Meth-
   ger Schwenk, Loı̈c Barrault, and Antoine Bordes. Super-               ods in Natural Language Processing (EMNLP), pages
   vised learning of universal sentence representations from             1532–1543, 2014.
   natural language inference data. CoRR, abs/1705.02364,             [Pichl et al., 2018] Jan Pichl, Petr Marek, Jakub Konrád,
   2017.                                                                 Martin Matulı́k, Hoang Long Nguyen, and Jan Se-
[He et al., 2015] Kaiming He, Xiangyu Zhang, Shaoqing                    divý.    Alquist: The alexa prize socialbot.       CoRR,
   Ren, and Jian Sun. Deep residual learning for image recog-            abs/1804.06705, 2018.
   nition. CoRR, abs/1512.03385, 2015.                                [Ram et al., 2018] Ashwin Ram, Rohit Prasad, Chandra
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and                   Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff
   Jürgen Schmidhuber. Long short-term memory. Neural                   Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar,
   Comput., 9(8):1735–1780, November 1997.                               Eric King, Kate Bland, Amanda Wartick, Yi Pan, Han
[Hochreiter et al., 2001] Sepp Hochreiter, Yoshua Bengio,                Song, Sk Jayadevan, Gene Hwang, and Art Pettigrue. Con-
   Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow                versational AI: the science behind the alexa prize. CoRR,
   in recurrent nets: the difficulty of learning long-term de-           abs/1801.03604, 2018.
   pendencies, 2001.                                                  [Redmon et al., 2015] Joseph Redmon, Santosh Kumar Div-
[Ioffe and Szegedy, 2015] Sergey Ioffe and Christian                     vala, Ross B. Girshick, and Ali Farhadi. You only
   Szegedy. Batch normalization: Accelerating deep net-                  look once: Unified, real-time object detection. CoRR,
   work training by reducing internal covariate shift. CoRR,             abs/1506.02640, 2015.
   abs/1502.03167, 2015.                                              [Ritter et al., 2011] Alan Ritter, Colin Cherry, and
[Joulin et al., 2016] Armand Joulin, Edouard Grave, Piotr                William B. Dolan. Data-driven response generation
   Bojanowski, and Tomas Mikolov. Bag of tricks for effi-                in social media. In Proceedings of the Conference on
   cient text classification. CoRR, abs/1607.01759, 2016.                Empirical Methods in Natural Language Processing,
                                                                         EMNLP ’11, pages 583–593, Stroudsburg, PA, USA,
[Kiros et al., 2015] Ryan Kiros, Yukun Zhu, Ruslan                       2011. Association for Computational Linguistics.
   Salakhutdinov, Richard S. Zemel, Antonio Torralba,
   Raquel Urtasun, and Sanja Fidler. Skip-thought vectors.            [Serban et al., 2017] Iulian Vlad Serban, Chinnadhurai
   CoRR, abs/1506.06726, 2015.                                           Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan
                                                                         Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper,
[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,               Sarath Chandar, Nan Rosemary Ke, Sai Mudumba,
   and Geoffrey E. Hinton. Imagenet classification with deep             Alexandre de Brébisson, Jose Sotelo, Dendi Suhubdy,
   convolutional neural networks. In Proceedings of the 25th             Vincent Michalski, Alexandre Nguyen, Joelle Pineau, and
   International Conference on Neural Information Process-               Yoshua Bengio. A deep reinforcement learning chatbot.
   ing Systems - Volume 1, NIPS’12, pages 1097–1105, USA,                CoRR, abs/1709.02349, 2017.
   2012. Curran Associates Inc.
                                                                      [Sundararajan et al., 2017] Mukund Sundararajan, Ankur
[Le and Mikolov, 2014] Quoc V. Le and Tomas Mikolov.                     Taly, and Qiqi Yan. Axiomatic attribution for deep net-
   Distributed representations of sentences and documents.               works. arXiv preprint arXiv:1703.01365, 2017.
   CoRR, abs/1405.4053, 2014.
                                                                      [Ulyanov et al., 2016] Dmitry Ulyanov, Andrea Vedaldi, and
[Levitin and Reingold, 1994] Lev B Levitin and Zeev Rein-                Victor S. Lempitsky. Instance normalization: The miss-
   gold. Entropy of natural languages: Theory and experi-                ing ingredient for fast stylization. CoRR, abs/1607.08022,
   ment. Chaos, Solitons & Fractals, 4(5):709 – 743, 1994.               2016.
[Liu et al., 2017] Huiting Liu, Tao Lin, Hanfei Sun, Weijian          [van den Oord et al., 2017a] Aäron van den Oord, Yazhe
   Lin, Chih-Wei Chang, Teng Zhong, and Alexander I. Rud-                Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals,
   nicky. Rubystar: A non-task-oriented mixture model dia-               Koray Kavukcuoglu, George van den Driessche, Ed-
   log system. CoRR, abs/1711.02781, 2017.                               ward Lockhart, Luis C. Cobo, Florian Stimberg, Norman
[Marelli et al., 2014] Marco Marelli, Stefano Menini, Marco              Casagrande, Dominik Grewe, Seb Noury, Sander Diele-
   Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto             man, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex
   Zamparelli. A sick cure for the evaluation of composi-                Graves, Helen King, Tom Walters, Dan Belov, and Demis
   tional distributional semantic models. 05 2014.                       Hassabis. Parallel wavenet: Fast high-fidelity speech syn-
[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Cor-                thesis. CoRR, abs/1711.10433, 2017.
   rado, and Jeffrey Dean. Efficient estimation of word rep-          [van den Oord et al., 2017b] Aäron van den Oord, Yazhe
   resentations in vector space. CoRR, abs/1301.3781, 2013.              Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals,


                                                                 22
  Koray Kavukcuoglu, George van den Driessche, Ed-
  ward Lockhart, Luis C. Cobo, Florian Stimberg, Norman
  Casagrande, Dominik Grewe, Seb Noury, Sander Diele-
  man, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex
  Graves, Helen King, Tom Walters, Dan Belov, and Demis
  Hassabis. Parallel wavenet: Fast high-fidelity speech syn-
  thesis. CoRR, abs/1711.10433, 2017.
[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
  Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
  Łukasz Kaiser, and Illia Polosukhin. Attention is all you
  need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
  lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
  Advances in Neural Information Processing Systems 30,
  pages 5998–6008. Curran Associates, Inc., 2017.
[Weissenborn et al., 2017] Dirk Weissenborn, Georg Wiese,
  and Laura Seiffe. Fastqa: A simple and efficient neural ar-
  chitecture for question answering. CoRR, abs/1703.04816,
  2017.
[Yang et al., 2018] Yinfei Yang, Steve Yuan, Daniel Cer,
  Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge,
  Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Learn-
  ing semantic textual similarity from conversations. CoRR,
  abs/1804.07754, 2018.
[Yusupov and Kuratov, 2017] Idris Yusupov and Yurii Kura-
  tov. Skill-based conversational agent. 12 2017.
[Zhang and LeCun, 2018] X. Zhang and Y. LeCun. Byte-
  Level Recursive Convolutional Auto-Encoder for Text.
  ArXiv e-prints, February 2018.


                                                                23