<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Heterogeneous Encoders Scaling in the Transformer for Neural Machine Translation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jia Cheng Hu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Cavicchioli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Berardinelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Capotondi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Communication and Economics (University of Modena and Reggio Emilia)</institution>
          ,
          <addr-line>viale Antonio Allegri 9, Reggio Emilia, 42121</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Physics, Informatics and Mathematics (University of Modena and Reggio Emilia)</institution>
          ,
          <addr-line>via Campi 213/A, Modena, 41125</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Although the Transformer is currently the best-performing architecture in the homogeneous configuration (self-attention only) in Neural Machine Translation, many State-of-the-Art models in Natural Language Processing are made of a combination of diferent Deep Learning approaches. However, these models often focus on combining a couple of techniques only and it is unclear why some methods are chosen over others. In this work, we investigate the efectiveness of integrating an increasing number of heterogeneous methods. Based on a simple combination strategy and performance-driven synergy criteria, we designed the Multi-Encoder Transformer, which consists of up to five diverse encoders. Results showcased that our approach can improve the quality of the translation across a variety of languages and dataset sizes and it is particularly efective in low-resource languages where we observed a maximum increase of 7.16 BLEU compared to the single-encoder model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine-Translation</kwd>
        <kwd>Transformer</kwd>
        <kwd>Encoder</kwd>
        <kwd>Scaling</kwd>
        <kwd>Low-resource-language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Neural Machine Translation (NMT), in Natural Language Processing (NLP), describes the
problem of automatically translating a text sentence from one language into another using
Neural Networks. Over the past years, translation systems performed increasingly better thanks
to the design of more sophisticated and efective models. Models consist of an encoder that
extracts a representation from the input sequence in the source language and a decoder that
generates the same sentence in the target language.</p>
      <p>
        There are currently several types of neural processing methods. The most popular ones
consist of self-attentive networks (SANs or ANNs) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], recurrent neural networks [3] (RNNs)
and convolutional (CNN) ones [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">4, 5, 6</xref>
        ]. More recently, the works of [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7 ref8">8, 9</xref>
        ] proposed
two additional variants. These methods difer in the way they process a sequence: for instance,
recurrent neural networks [3] apply the same operation over the sequence in a sequential
manner and each element is given access to the previous ones thanks to one or two hidden
states. Self-attentive neural networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] instead, make use of the Attention mechanism [
        <xref ref-type="bibr" rid="ref10 ref9">10, 11</xref>
        ],
a method that constructs representations made of the Softmax-weighted sum of all the elements
in the sequence. Each of these approaches presents advantages and drawbacks on its own. For
instance, RNNs better capture the sequentiality in the data compared to stateless methods [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ],
such as CNNs and SANs, that rely on simple numeric encodings. However, as a drawback, the
ifrst sufers from computational limitations and a limited receptive field.
      </p>
      <p>
        In the presence of a variety of possible processing strategies, State-of-the-Art models often
leverage the strength of a combination of multiple approaches in both NLP [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15 ref16">13, 14, 15, 16, 17</xref>
        ]
and Vision [
        <xref ref-type="bibr" rid="ref11 ref8">9, 12</xref>
        ]. The efectiveness of hybrid models outlines the benefits of introducing
heterogeneous ways of processing inside the same architecture. However, often in these
architectures, only two methods are considered and to the best of our knowledge, little has been
done on the combination of a larger number of heterogeneous encoding methods. Additionally,
diferent techniques in State-of-the-Art models are often closely intertwined in one single
processing block. As a matter of fact, there is a lack of consensus on how to combine efectively
each technique. This type of integration approach is limiting in terms of applicability since not
all encoding strategies can be easily combined. For instance, in a single encoder architecture,
combining either the convolution or the self-attention with recurrent methods ultimately results
in a recurrent encoder.
      </p>
      <p>
        In this work, we propose an investigation into the efectiveness of combining an increasing
number of diverse neural network processing methods. In particular, we adopt the Transformer
decoder [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and an encoder that is made of an increasing number of heterogeneous methods.
Moreover, we propose a very simple encoder combination strategy such as the simple sum.
On one hand, it is bound to be less performing than ad-hoc hybrid designs. On the other, its
simplicity enables the addition of an arbitrary number of independent and diverse encoders.
      </p>
      <p>The paper is organized as follows. Section 2 presents related works and several encoding
techniques. Then, in Section 3, we introduce the Multi-Encoder Transformer, which encoder
combines up to five processing methods: Self-Attention, Convolutional, LSTM, Fourier
Transform (FNet), and Static Expansion. In Section 4, we present five translation datasets on which
our models are trained and tested. Finally, in Section 5, we present the results and discuss the
advantages and limitations of our multi-encoder transformer. In particular, we discover that
diferent methods synergize diferently when combined with each other and that low-resource
languages benefit the most from the increase in the number of heterogeneous encoders.</p>
      <p>In summary, our contributions are as follows: (i) we analyze the efectiveness of simply
summing multiple encoding strategies in the Transformer, in particular adopting the RNN, CNN,
SAN, Static Expansion and Fourier Transform (FNet) across a variety of Language translation
tasks; (ii) we analyze the synergies of the five processing methods and design a dual-encoder,
triple encoder, quadruple and quintuple encoder based upon the results; (iii) we show that
the multi-encoding achieves better performances with respect to the baseline Transformer,
despite each encoder presenting a poorer performance in the single instance; (iv) we show
that low-resource languages are those that benefit the most from combining heterogeneous
encoders.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Many works in several research fields explored the benefits of combining multiple, sometimes
complementary, encoding strategies. QRNNs [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ] tackled the computational burden of RNNs
and proposed to exploit the eficiency of the convolution operation while preserving the strong
sequential awareness using recurrent pooling layers in various NLP tasks. In Neural Machine
Translation [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ] integrated the self-attention layer in the GNMT [
        <xref ref-type="bibr" rid="ref17">18</xref>
        ], a full RNN network based
on stacks of LSTM networks [
        <xref ref-type="bibr" rid="ref18">19</xref>
        ]. Conversely, [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ] explored the opposite solution and integrated
the LSTM inside the Transformer. The works of [
        <xref ref-type="bibr" rid="ref19">20</xref>
        ] and [
        <xref ref-type="bibr" rid="ref14">15</xref>
        ] merged the self-attention method
and convolution on a deep level, forming a moving window in which relationship among
sequence tokens are extracted on a local level instead of a global one. On a similar level, in
Audio Speech Recognition, best-performing architectures such as the Conformer [
        <xref ref-type="bibr" rid="ref16">17</xref>
        ] and
SqueezeFormer [
        <xref ref-type="bibr" rid="ref15">16</xref>
        ] rely on a combination of convolution layers and self-attentive blocks to
capture both local and global correlations respectively in the spectrum representation. Similar
ideas were applied in other tasks such as Reading Comprehension [
        <xref ref-type="bibr" rid="ref20">21</xref>
        ] and Image Classification
[22]. The adoption of multiple encoders is natural for some problems, such as those involving
multiple input sequences [23, 24] which also led to the investigation of optimal interconnection
between the encoders and the decoder [
        <xref ref-type="bibr" rid="ref12 ref13">25, 14, 13</xref>
        ]. Also, the scalability of Transformers has
been deeply investigated in a variety of contexts [26, 27, 28, 29, 30] which mostly focus on
increasing the number of layers or sequence length.
      </p>
      <p>In contrast to these works, our work is the only one that investigates the efectiveness of
increasing the number of heterogeneous encoding methods in the Transformer and present
hybrid architectures made of an unprecedented number of strategies.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Heterogeneous Multi-Encoder</title>
      <sec id="sec-3-1">
        <title>3.1. Encoding Strategies</title>
        <p>Given the input sequence  = {1, 2, . . . ,  },  ∈ R and  the hidden dimension, we
describe five encoding strategies based on recurrency, convolution, self-attention, Fourier
transform and static expansion.</p>
        <p>
          LSTM. Recurrent Neural Networks are sequence modelling methods inspired by recurrent
functions, which output at time step  ∈ {1, 2, . . . ,  } are afected by all the input of the
previous steps by means of the hidden states. The hidden state consists of one or multiple
vectors depending on the architecture. In this work, we consider the LSTM [
          <xref ref-type="bibr" rid="ref18">19</xref>
          ] encoder whose
formula is showcased below:
⎧  = ( − 1 +  ℎ− 1 +  )
⎪⎪⎪  = (− 1 + ℎ− 1 + )
⎪
⎪
⎪⎨  = (− 1 + ℎ− 1 + )
⎪  = ℎ(− 1 + ℎ− 1 + )
⎪⎪⎪  =  ⊙ − 1 +  ⊙ 
⎪
⎪
⎩ ℎ =  ⊙ ℎ()
where ,  ∈ R4· × ,  ∈ R4·  and , ,  are the forget, input, output gates, whereas
ℎ,  represent the hidden states. We restrict to the uni-directional formulation because the
bidirectional processing of the input is already provided by the bidirectional nature of
selfattention, which supports the goal of experimenting with diverse encoding strategies.
        </p>
        <p>
          ConvS2S. Convolution networks were partially introduced in Natural Language Processing
to overcome the computational ineficiency of recurrent models [
          <xref ref-type="bibr" rid="ref4">5</xref>
          ] but later proved to be an
efective alternative to other sequence modelling architectures and are even able to achieve
State-of-the-Art performances [
          <xref ref-type="bibr" rid="ref14 ref19 ref5">6, 15, 31, 20</xref>
          ]. In this work, we will focus on the ConvS2S encoder
[
          <xref ref-type="bibr" rid="ref5">6</xref>
          ]. Similar to the Vision case, its working principle lies in the filter banks learning the local and
most simple semantics among elements in the sequence during earlier stages and extracting
longer and more complex relationships in the higher-order layers. Given an input sequence
 and an odd kernel size , the encoder is made of a set of filters  ∈ R2· × 2·  that perform
the convolution operation for each time step  = {1, 2, . . . ,  } in the input sequence. The
output dimension is twice the model dimension in order to be fed into the GLU function, which
represents the non-linear component of the layer:
        </p>
        <p>(, ) =  ⊗ ()
where ,  ∈ R are the two halves of the output vector.</p>
        <p>
          Self-Attention. The attention mechanism was first introduced in NLP in [
          <xref ref-type="bibr" rid="ref9">10</xref>
          ], and it was
applied to recurrent models drastically improving the performances by allowing the encoder
information to be distributed on each encoder input rather than one single state vector. The
method was later refined in [
          <xref ref-type="bibr" rid="ref10">11</xref>
          ] and the two works provided the foundation of the Transformer
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Its main component, the self-attentive layer, consists of the following operations (considering
the single-head case):
⊺
 -(, ,  ) =  ( √
)
where , ,  ∈ R×  are linear projections of the same input. The method has been proven
to be able to extract the syntax and semantics as well as capture long-range relationships of the
sequence thanks to the global and bidirectional access to the whole input right at the first stage
of the network compared to the previous method.
        </p>
        <p>
          Static Expansion. The Static Expansion [
          <xref ref-type="bibr" rid="ref8">9</xref>
          ] proposes to distribute the sequence into an
arbitrary number of elements. In its essence, the idea consists of a Forward phase where the
input of length is transformed into a new sequence one featuring a new target length, and a
Backward where it’s transformed back to the original one. In practice, the input is linearly
projected four times, producing , , ,  ∈ R× . Given the desired length  , 2 · 
expansion vectors are considered, denoted as ,  ∈ R× . The static expansion layer
performs three operations.
        </p>
        <p>(i). First, the dot product similarity is computed between the expanded queries and the keys
obtaining the length transformation matrix L:
 =
⊺
√
(1)
The result is fed into a ReLU function and normalized. The sequences featuring new lengths are
computed in the following way:
(2)
(3)
(4)
(5)
 = Ψ( ((− 1)+1⊺),  )  ∈ 1, 2
 =  ()  ∈ 1, 2
(iii). The two results are combined through a sigmoid gate (sigmoid function indicated with
 ):</p>
        <p>=  () ⊙ 1 + (1 −  ()) ⊙ 2
It can be seen in the first version of the Static Expansion, the collection of 2 ·  expansion
vectors are learnable parameters of size . Since the original formulation was developed for
Image Captioning instead of NLP tasks, we enrich the formulation by providing each expansion
vector more awareness about the sequence by feeding them into a bilinear projection described
as:
 = Ψ( ((− 1)+1),  )  ∈ 1, 2
 = ( + )  ∈ 1, 2
Ψ : (,  ) →  , ,  ∈ R1× 2 is the normalization function and  ∈ R&gt;0 the coeficient
ensuring numeric stability.</p>
        <p>(ii). Using the same matrix of Equation 1, the sequence is transformed back to its original
length (Backward step):
′ = (
′ = (
⊺( )</p>
        <p>√
⊺( )
√
)
)
where  ∈ R ×  is the input sequence,  ∈ R×  and ,  ∈ R×  are learnable
weights.</p>
        <p>
          FNet layer. The Fourier Transform describes a mathematical operation that converts a
function defined in the time domain into its spectral representation. In particular, it extracts the
coeficients of its sinusoidal components localized in the frequency domain. The Discrete Fourier
Transform consists of the discrete formulation which is designed to address discontinuous and
sampled signals. In [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ] they adopt such a method and replace the self-attention in BERT,
achieving remarkably close performances despite a parameter-less formulation. The discrete
Fourier transform is defined as:
        </p>
        <p>− 1
+1 = ∑︁ +1− 2 , 0 ≤  ≤  − 1</p>
        <p>=0
which is denoted in vector form as ℱ () ∈ C× . The FNet layer is made of two DFTs applied
over the sequence dimension first and then over the hidden dimension, finally only the real
part of the results is preserved:</p>
        <p>ℱ () = (ℱ(ℱ ()))</p>
        <p>The mixing of tokens along the two dimensions eficiently provides enough accessibility of the
sequence to the higher order units, such as non-linear projections, in order to form meaningful
compositions of its elements.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Multi-Encoder Transformer</title>
        <p>The Multi-Encoder Transformer is made of the Transformer itself and multiple encoding blocks
joined together by means of a simple sum as depicted in Figure 1. Encoder blocks, depicted in
Figure 1, consist possibly of a stack of convolution filters, self-attentive layers, LSTMs, FNet
layer, or Static Expansion. All of them include skip-connections and normalization layers, with
the exception of the LSTM for the first and ConvS2S for the latter.</p>
        <p>The computational burden of the additional encoder is mitigated by the absence, in almost all
cases, of the Feed-Forward layer which represents one of the computational bottlenecks of the
transformer model. The only two exceptions are the baseline Transformer encoder, where the
component is kept to maintain consistency with the original formulation, and Fnet layers, in
which the Feed-Forward is necessary for the otherwise parameter-less definition of the module.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental setup</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>We used five datasets in our experiments. The IWSLT 2015 English-Vietnamese (En-Vi) corpus,
the IWSLT 2017 English-Italian (En-It) corpus, two TED talks in Galician-English (Gl-En),
SlovakEnglish (Sk-En) and a Spanish-English (Sp-En) corpus provided by the European Language
Resource Coordination (ELRC). Low-resource languages commonly refer to less studied,
resource scarce, and less commonly taught languages among other definitions [ 32]. In our work,
we denote low-resource languages as those that involve less than 30,000 training pairs. Although
“Sp-En” is a high-resource language, the dataset size simulates a low-resource setup.</p>
        <p>Sequences whose post-tokenization length is greater than a certain threshold are discarded,
which results in a negligible smaller training size but a notably smaller peak GPU memory
footprint. In all training instances, the target and the source language vocabulary are shared.
Each vocabulary is created using the BPE algorithm [33]. The maximum number of epochs
is chosen based on the number of training steps required for the validation accuracy to start
decreasing. For instance, the En-It training requires significantly fewer epochs than En-Vi most
likely because of a higher similarity between source and target languages. More Details are
reported in Table 1. Analytic experiments are performed mostly on the “En-Vi” pair, whose
size is close to the average size of adopted datasets and is preferred over “Sk-En” for better
generalization.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Models</title>
        <p>The baseline model denoted as “Base" consists of the Transformer with 6 layers, =512,
_ℎ=8,  =2048. When an encoder is indicated as “Base", we refer to the
“SelfAttention + FeedForward" case. For the multi-encoder Transformer, the number of additional
encoders, as well as the number of additional layers for each block, depends on the experiment
configuration and is selected in the following range of values 3, 6, 12, 18. The Static Expansion
layer uses the following 12 choices of static expansion coeficients { 6, 6, 12, 8, 12, 8, 6, 6, 12, 8,
12, 8 }. The choices of these specific coeficients are arbitrarily made and it is out of the scope
of this work to search for the optimal configuration. All kernels in the ConvS2S layers are of
width 3.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training and Evaluation Details</title>
        <p>
          Words are split according to the Byte Pair Encoding (BPE) [33], and no lower casing is
applied. Adam optimizer ( 1 = 0.9,  2 = 0.98,  = 10− 9) and Label Smoothing regularization
(smoothness  = 0.1) are adopted. We use the Noam learning rate scheduler as described in
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and 4,000 warm-up steps. Training pairs are sampled without replacement using a target
length-oriented bucketing until the mini-batch reaches 4,096 words. Finally, models are trained
up to the number of epochs reported in Table 1.
        </p>
        <p>We evaluate the performances using the BLEU metric [34]. In particular SacreBLEU [35]1. The
inference algorithm consists of the Beam Search with beam size 4 and in contrast to the standard
practice, no checkpoint averaging is done since the selection criteria are often time-based and
lead to diferent results according to the computational resources.
1SacreBLEU signature: BLEU+case.mixed+ lang.[src-lang]-[dst-lang]+numrefs.1+ smooth.exp+tok.13a+version.2.0.0</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Single Encoder Performances</title>
        <p>We evaluate the performances of each single encoder strategy by replacing the encoder block in
the Transformer (with both six encoding and decoding layers) with all five encoding methods
in the “En-Vi” translation task. This data set is chosen for the analytic experiments throughout
the rest of the experiments.</p>
        <p>Each single encoder performance is reported in Table 2. Since transformers are sensible
to hyperparameters and implementation details [36, 37] we show that our baseline achieves
a very similar performance compared to the ones typically observed in the literature. In the
particular case of [38] it presents a negligible diference of 0.02 BLEU, paving the way for a
fair comparison. In the single-encoder case, the baseline model outperforms all the other ones.
Table 3 reports the baseline performances across all translation tasks. Hence, the Self-Attention
method performs better than the others in the single-encoder configuration.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Dual-encoder Transformer and Synergy Study</title>
        <p>In Table 4 we show how the Dual-Encoder Transformer performs on the “En-Vi” translation
task, when the baseline, whose encoder is made of Self-Attention layers, is combined with
other encoding techniques. The performance seems to increase proportionally to the number of
additional layers  in some cases. This is particularly evident in the case of the LSTM whereas
the impact of increasing the number of layers is less significant in other cases. The number of
layers for the models presented throughout this work is based on the results in Table 4.</p>
        <p>We observe that combining two encoders does not guarantee the overall system performs at
least as well as the single encoder case. All dual-encoders yielded a worse score compared to
the baseline with the exception of the “Base + LSTMM=18" and “Base + Static ExpM=12" which
instead scored better with a margin of 0.09 and 0.60 BLEU. This means that such improvements
are caused by neither the architectural structure nor the increase of parameters alone since
other instances performed similarly if not worse. This claim is further supported by the poor
performances reported in Table 5 in which encoders are duplicated and summed together.</p>
        <p>Table 4 also shows that, despite the simplicity of the combination strategy, it is possible to
design a Dual-Encoder Transformer that performs better than its respective single-encoder
counterparts. To further investigate this aspect, given the set of encoding methods E={B, L, C, S,
F} representing the self-attention, LSTM, Static Expansion and FNet respectively, we measure
the synergy  ∈ R between method  and  simply as the performance diference between
the dual-encoder  +  and the single-encoder made of the  method  + . The Synergy matrix
 is reported in Table 6.</p>
        <p>
          Static Expansion and the LSTM combined well with the Transformer encoder, the latter case
was expected thanks to the works of [
          <xref ref-type="bibr" rid="ref12 ref13">13, 14</xref>
          ]. On the contrary, the ConvS2S case contradicts
the efectiveness observed in literature [
          <xref ref-type="bibr" rid="ref14 ref15 ref16 ref19">20, 15, 17, 16</xref>
          ] which highlights the limitations of our
trivial combination strategy. The pair “Base + FNet" seems to perform the worst.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Scaling Encoders</title>
        <p>Based on the Synergy matrix reported in Table 6 we increase the number of encoders in the
Multi-Encoder Transformer and discuss their performances across all data sets. To decide which
methods are used in the dual and triple encoder configurations, we select the top three values
of the formula ,∈{1,...,5}{ + } applied on the Synergy matrix reported in Table 6. In
particular, the top three values are represented by 2.93, 2.10 and 1.38 corresponding to “Base +
Static ExpM=12", “Base + LSTMM=18" and “LSTMM=18 + ConvS2SM=6". For this reason, we design
the Dual-Encoder as “Base + Static Exp", the Triple-Encoder is made of “Base + Static Exp +
LSTM", the Quadruple-Encoder is made of “Base + Static Exp + LSTM + ConvS2S" and the
Quintuple-Encoder is made of all encoding methods. The number of layers for each additional
encoder is configured according to the size of the training set, in particular, for “Gl-En" and
“Sp-En" we deploy a smaller version of the Multi-Encoder Transformer.</p>
        <p>Several observations can be made from the results reported in Table 7. (i). First of all, the
Dual-Encoder outperforms the baseline across all data sets of diferent sizes and languages,
whereas increasing the number of encoders beyond two leads to mixed results. In particular,
the performance increase can be appreciated most in the case of two encoders; beyond this case,
the improvement is smaller if not even worse. This may explain why State-of-the-Art models in
the literature typically combine two deep learning methods at most. (ii). The Multi-Encoder
Transformer achieved a maximum increase of 5.35 and 7.16 BLEU in the case of “Gl-En” and
“Sp-En” respectively, suggesting that richer encoder representations can compensate for the
lack of data. Therefore, adopting multiple heterogeneous encoders can be a suitable strategy for
low-resource languages. (iii). In the case of bigger datasets, the Multi-Encoder Transformer
can be beneficial even though there is no clear relationship between data size and the increase
in performance. For example, “En-It” and “Sk-En” tasks benefit more from the Dual-Encoder
configuration compared to “En-Vi” despite having a greater and a smaller number of training
samples respectively. (iv). The case of the Quintuple-Encoder appears to be the breaking point
of the benefits of our approach, however, this may be caused by the Fourier Transform method
not combining well with others rather than a limitation of the scaling method.</p>
        <p>In Table 8 we compare the best results against MarianMT [39], a Transformer model
pretrained on 1M samples from OPUS-100 [40] and fine-tuned on our datasets for 50 epochs.
Our model performs reasonably worse in low-resource setups, due to the lack of data, but
comparably or better otherwise. This suggests that MarianMT architecture can be improved
with our approach to achieve even better results. Overall, increasing the number of deep
learning methods in the encoder can be beneficial but results may depend on the application
and data size. Deep learning practitioners may consider a trade-of between performance and
computational cost in case they decide to follow this strategy.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Limitations</title>
        <p>Although we removed, wherever possible, the feed-forward layer, the additional encoders
inevitably introduce a significant additional computational cost and memory footprint as can
be seen from Table 9. As a result, the Multi-Encoder Transformer is increasingly less eficient
than the single-encoder baseline depending on the number of additional encoders. Fortunately,
benefits can be observed early in the dual-encoder configuration and improvements can be
appreciated by using less than double the amount of training time spent for the baseline.</p>
        <p>
          The performances reported in this work do not necessarily represent the full capabilities
of each model. We focused only on one configuration of the entire hyperparameter space. In
particular, we adopted the learning rate scheduler, optimizer, and initialization method of [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
Number of parameters (denoted with  ) and the number GFLOPS (denoted as ) required by the
MultiEncoder Transformer for the forward step of a sequence of 128 tokens in both encoder and decoder.
        </p>
        <sec id="sec-5-4-1">
          <title>Model Task Gl-En, Sp-En Sk-En, En-Vi, En-It 11.8</title>
        </sec>
        <sec id="sec-5-4-2">
          <title>Triple</title>
          <p>and the same batch size is used for most configurations. While such choices are ideal for the
Transformer, they may not be suitable for all the architectures introduced in this work. The
exploration of a variety of configurations is often recommended in order to fully appreciate
new models. Such fine-tuning practice is beyond the scope of the work.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this work, we conducted an empirical study of the efectiveness of implementing an encoder
with an increasing number of heterogeneous Deep learning methods. We considered five Deep
Learning methods: self-attention, LSTM, convolution, static expansion, and Fourier transform.
We discovered that the Multi-Encoder Transformer can be an efective architectural design
for two encoders. However, by increasing the number of encoders beyond two, the impact is
limited and can be both positive, such as in the case of low-resource languages, and harmful in
other instances.</p>
      <p>Despite combining diferent encoders with a simple sum, the Multi-Encoder Transformer,
particularly the Dual-Encoder configuration, outperformed the baseline across all benchmarks.
However, connecting more than two encoders resulted in mixed impact. We suspect this might
be due to the extreme simplicity of our combination method. Future works can investigate
better combining each encoder representation or even formulating an interconnection module
specifically designed to leverage the diversity of diferent Deep Learning methods.
on Natural Language for Artificial Intelligence (NL4AI), in: Proceedings of the Seventh
Workshop on Natural Language for Artificial Intelligence (NL4AI 2023) co-located with
22th International Conference of the Italian Association for Artificial Intelligence (AI* IA
2023), 2023.
2017, pp. 5998–6008.
sukhin, Attention is all you need, in: Advances in neural information processing systems,
[3] N. Kalchbrenner, P. Blunsom, Recurrent continuous translation models, in: Proceedings
of the 2013 conference on empirical methods in natural language processing, 2013, pp.
arXiv:1804.09541 (2018).
[22] Z. Dai, H. Liu, Q. V. Le, M. Tan, Coatnet: Marrying convolution and attention for all data
sizes, Advances in Neural Information Processing Systems 34 (2021) 3965–3977.
[23] J. Shin, J.-H. Lee, Multi-encoder transformer network for automatic post-editing, in:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 2018,
pp. 840–845.
[24] S. Humeau, K. Shuster, M.-A. Lachaux, J. Weston, Poly-encoders: Transformer architectures
and pre-training strategies for fast and accurate multi-sentence scoring, arXiv preprint
arXiv:1905.01969 (2019).
[25] J. Libovicky`, J. Helcl, D. Mareček, Input combination strategies for multi-source transformer
decoder, arXiv preprint arXiv:1811.04716 (2018).
[26] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling transformers to
1,000 layers, arXiv preprint arXiv:2203.00555 (2022).
[27] S. Jaszczur, A. Chowdhery, A. Mohiuddin, L. Kaiser, W. Gajewski, H. Michalewski, J.
Kanerva, Sparse is enough in scaling transformers, Advances in Neural Information Processing
Systems 34 (2021) 9895–9907.
[28] X. Liu, K. Duh, L. Liu, J. Gao, Very deep transformers for neural machine translation, arXiv
preprint arXiv:2008.07772 (2020).
[29] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, S. Stüker, A. Waibel, Very deep
selfattention networks for end-to-end speech recognition, arXiv preprint arXiv:1904.13377
(2019).
[30] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer, arXiv
preprint arXiv:2004.05150 (2020).
[31] Z. Liu, S. Luo, W. Li, J. Lu, Y. Wu, S. Sun, C. Li, L. Yang, Convtransformer: A convolutional
transformer network for video frame synthesis, arXiv preprint arXiv:2011.10185 (2020).
[32] A. Magueresse, V. Carles, E. Heetderks, Low-resource languages: A review of past work
and future challenges, arXiv preprint arXiv:2006.07264 (2020).
[33] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword
units, arXiv preprint arXiv:1508.07909 (2015).
[34] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of
machine translation, in: Proceedings of the 40th annual meeting of the Association for
Computational Linguistics, 2002, pp. 311–318.
[35] M. Post, A call for clarity in reporting bleu scores, arXiv preprint arXiv:1804.08771 (2018).
[36] T. Q. Nguyen, J. Salazar, Transformers without tears: Improving the normalization of
self-attention, arXiv preprint arXiv:1910.05895 (2019).
[37] M. Popel, O. Bojar, Training tips for the transformer model, The Prague Bulletin of</p>
      <p>Mathematical Linguistics 110 (2018). doi:10.2478/pralin-2018-0002.
[38] I. Provilkov, D. Emelianenko, E. Voita, Bpe-dropout: Simple and efective subword
regularization, arXiv preprint arXiv:1910.13267 (2019).
[39] M. Junczys-Dowmunt, R. Grundkiewicz, T. Dwojak, H. Hoang, K. Heafield, T. Neckermann,
F. Seide, U. Germann, A. F. Aji, N. Bogoychev, et al., Marian: Fast neural machine translation
in c++, arXiv preprint arXiv:1804.00344 (2018).
[40] J. Tiedemann, Opus-parallel corpora for everyone., Baltic Journal of Modern Computing 4
(2016).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bassignana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , Preface to the Seventh Workshop
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser, I. Polo-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>LeCun, Character-level convolutional networks for text classification</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>649</fpage>
          -
          <lpage>657</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Merity</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Quasi-recurrent neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:1611.01576</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yarats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <article-title>Convolutional sequence to sequence learning</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1243</fpage>
          -
          <lpage>1252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee-Thorp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ainslie</surname>
          </string-name>
          , I. Eckstein,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ontanon</surname>
          </string-name>
          , Fnet:
          <article-title>Mixing tokens with fourier transforms</article-title>
          ,
          <source>arXiv preprint arXiv:2105.03824</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cavicchioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Capotondi</surname>
          </string-name>
          ,
          <article-title>Exploring the sequence length bottleneck in the transformer for image captioning</article-title>
          ,
          <source>arXiv preprint arXiv:2207.03327</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cavicchioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Capotondi</surname>
          </string-name>
          , Expansionnet v2:
          <article-title>Block static expansion in fast end to end training for image captioning</article-title>
          ,
          <source>arXiv preprint arXiv:2208.06551</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Bengio,</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          ,
          <source>arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Efective approaches to attention-based neural machine translation</article-title>
          ,
          <source>in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Lisbon, Portugal,
          <year>2015</year>
          , pp.
          <fpage>1412</fpage>
          -
          <lpage>1421</lpage>
          . URL: https://aclanthology.org/D15-1166. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D15</fpage>
          -1166.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Attention on attention for image captioning</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4634</fpage>
          -
          <lpage>4643</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Firat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bapna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , W. Macherey,
          <string-name>
            <given-names>G.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>The best of both worlds: Combining recent advances in neural machine translation</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>09849</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <article-title>Modeling recurrence for transformer</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>03092</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <article-title>Convolutional self-attention networks</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>03107</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gholami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mangalam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Mahoney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Keutzer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Squeezeformer:</surname>
          </string-name>
          <article-title>An eficient transformer for automatic speech recognition</article-title>
          ,
          <source>arXiv preprint arXiv:2206.00888</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Chiu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            , W. Han,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Conformer: Convolution-augmented transformer for speech recognition</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>08100</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Macherey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krikun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Macherey</surname>
          </string-name>
          , et al.,
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          ,
          <source>arXiv preprint arXiv:1609.08144</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <article-title>Pay less attention with lightweight and dynamic convolutions</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>10430</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dohan</surname>
          </string-name>
          , M.-T. Luong,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Qanet: Combining local convolution with global self-attention for reading comprehension</article-title>
          , arXiv preprint
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>