=Paper=
{{Paper
|id=Vol-3074/paper17
|storemode=property
|title=Fuzzy sets complement-based Gated Recurrent Unit
|pdfUrl=https://ceur-ws.org/Vol-3074/paper17.pdf
|volume=Vol-3074
|authors=Mikel Ferrero-Jaurrieta, Graçaliz Pereira Dimuro, Zdenko Takáč, Regivan H. N. Santiago, Javier Fernández,Humberto Bustince
|dblpUrl=https://dblp.org/rec/conf/wilf/Ferrero-Jaurrieta21
}}
==Fuzzy sets complement-based Gated Recurrent Unit==
<pdf width="1500px">https://ceur-ws.org/Vol-3074/paper17.pdf</pdf>
<pre>
Fuzzy sets complement-based Gated Recurrent Unit
Mikel Ferrero-Jaurrieta1 , Graçaliz Pereira Dimuro1,2 , Zdenko Takáč3 ,
Regivan H. N. Santiago4 , Javier Fernández1 and Humberto Bustince1
1
  Department of Statistics, Computer Science and Mathematics, Public University of Navarre, Campus Arrosadía, s/n,
31006 Pamplona, Spain
2
  Centro de Ciências Computacionais, Universidade Federal do Rio Grande, Rio Grande, 96044540, Brazil
3
  Institute of Information Engineering, Automation and Mathematics, Faculty of Chemical and Food Technology,
Slovak University of Technology in Bratislava, Radlinského, 9, Bratislava, 812 37, Slovakia
4
  Department of Computer Science and Applied Mathematics, Universidade Federal do Rio Grande do Norte, Natal,
1524, Brazil


                                         Abstract
                                         Gated Recurrent Units (GRU) are neural network gated architectures that simplify other ones (such
                                         as, LSTM) by joining gates mainly. For this, instead of using two gates, if 𝑥 is the first gate, standard
                                         operation 1 − 𝑥 is used to generate the second one, optimizing the number of parameters. In this
                                         work, we interpret this information as a fuzzy set, and we generalize the standard operation using fuzzy
                                         negations, and improving the accuracy obtained with the standard one.

                                         Keywords
                                         Fuzzy set complement, Fuzzy negations, Recurrent neural networks, Gated recurrent unit,


1. Introduction
Recurrent Neural Networks (RNN) [1, 2] were introduced in the 1980s to deal with sequential
data modeling such as time series or text processing. Nevertheless, RNN had a big problem in
training process, since the gradient value gradually tends to 0. This problem was known as
vanishing gradient problem [3]. With the objective of solving this problem, Long Short-Term
Memories (LSTM) were introduced in 1997 by S. Hochreiter and J. Schmidhuber [4]. They are
based in a gated mechanism. LSTM networks have had various modifications [5]. In 2014 Cho et.
al improved this system [6] simplifying the gated-based unit architecture using only a memory
instead of long and short-memories.
To simplify the gate system, Gated Recurrent Unit (GRU) couples gates, using only one gate for
modulating the information update of the cell. This means that, instead of using the g1 and
g2 gates independently, they are coupled, using the following relation g2 = 1 − g1 [5] and
consequently reducing the number of parameters to learn.
In this work, we consider the update process of the GRU unit as a fuzzy set [7] . In this sense, for
a concrete element, a membership value near 0 means that the element is not going to update

WILF 2021: 13th International Workshop on Fuzzy Logic and Applications, Dec. 20–22, 2021, Vietri sul Mare, Italy
" mikel.ferrero@unavarra.es (M. Ferrero-Jaurrieta); gracaliz.pereira@unavarra.es (G. P. Dimuro);
zdenko.takac@stuba.sk (Z. Takáč); regivan@dimap.ufrn.br (R. H. N. Santiago); fcojavier.fernandez@unavarra.es
(J. Fernández); bustince@unavarra.es (H. Bustince)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
and a value near 1 means that is going to update almost fully. Therefore, the 1 − 𝑥 operation
can be understood as a negation or the complement of the fuzzy set in question. In this way, we
generalize the expression 1 − 𝑥 of the GRU equations by using fuzzy negations [8, 9, 10], and
generating the complementary fuzzy set from these negations. Experimentally different fuzzy
negations are considered, where both fixed expressions and values that are learned by the Gated
Recurrent Unit itself are used. We test our results with a text classification dataset, and we
show that our approach using different expressions [8] improves the performance of the GRU.
The structure of this work is as follows. In Section 2 the fuzzy and GRU preliminaries are
reminded. In Section 3 the GRU architecture modification is explained. In Section 4 the
experimental framework and results are presented. Finally, some conclusions and future research
are described in Section 5.


2. Preliminaries
In the present section we present the definitions and constructions of fuzzy negations and we
also explain the main concepts about the GRU.

2.1. Fuzzy sets complementarity and fuzzy negations
From now on, we denote by 𝑋 a non-empty and finite universe.
Definition 2.1. [7] A fuzzy set 𝐴 on 𝑋 is given by 𝐴 = {(𝑥𝑖 , 𝐴(𝑥𝑖 )) | 𝑥𝑖 ∈ 𝑋} where, by abuse
of notation 𝐴 denotes a map 𝐴 : 𝑋 → [0, 1]. The value 𝐴(𝑥𝑖 ) is referred to as membership degree
of the element 𝑥𝑖 ∈ 𝑋 to the fuzzy set 𝐴.

Definition 2.2. [11] A function 𝑁 : [0, 1] → [0, 1] is called a fuzzy negation if (N1) 𝑁 (0) = 1
and 𝑁 (1) = 0 and (N2) is decreasing: if 𝑥 ≤ 𝑦 then 𝑁 (𝑥) ≥ 𝑁 (𝑦) for all 𝑥, 𝑦 ∈ [0, 1].

Definition 2.3. [11] A fuzzy negation N is called strict if (N3) is continuous and (N4) is strictly
decreasing, i.e. 𝑁 (𝑥) < 𝑁 (𝑦) when 𝑦 < 𝑥 for all 𝑥, 𝑦 ∈ [0, 1].

Definition 2.4. [11] A fuzzy negation 𝑁 is called strong if it is an involution, i.e., (N5) 𝑁 (𝑁 (𝑥)) =
𝑥 for all 𝑥 ∈ [0, 1].

Strong fuzzy negations are also strict fuzzy negations.
Example 2.5. (i) The standard strong fuzzy negation is defined as 𝑁𝑍 (𝑥) = 1 − 𝑥 known as
      the standard or Zadeh’s Negation.
 (ii) [11] Another examples of fuzzy negations are shown on Table 1 and represented on Figure
      2.2.

Definition 2.6. [8] A function 𝜙 : [0, 1] → [0, 1] is an automorphism on the interval [0, 1] if it
is continuous, strictrly increasing and satisfies the boundary conditions 𝜙(0) = 0 and 𝜙(1) = 1.

Theorem 2.7. [10] A function 𝑁 : [0, 1] → [0, 1] is a strong negation if and only if there exists
an automorphism 𝜙 : [0, 1] → [0, 1] such that: 𝑁 (𝑥) = 𝜙−1 (1 − 𝜙(𝑥)).
Table 1
Fuzzy negation examples and their properties
                    Name                     Negation function               Properties
                                                                   2
                                               𝑁𝐾 (𝑥) = 1 −√  𝑥              (N1) - (N4). Strict.
                                               𝑁𝑅 (𝑥) = 1 − 𝑥                (N1) - (N4). Strict.
                                                      1−𝑥
                Sugeno class               𝑁 𝜆 (𝑥) = 1+𝜆𝑥 , 𝜆 > −1           (N1) - (N5). Strong.
                                                              1
                 Yager class            𝑁 (𝜔) (𝑥) = (1 − 𝑥𝜔 ) 𝜔 , 𝜔 > 0      (N1) - (N5). Strong.

                               1

                              0.8
                                                                                     𝑁 (2)
                                                                                     𝑁𝐾
                              0.6
                      𝑁 (𝑥)


                                                                                      𝑁𝑍
                                                                                      𝑁𝑅
                              0.4                                                     (︁
                                                                                           1
                                                                                               )︁
                                                                                     𝑁 2

                              0.2

                               0
                                    0      0.2     0.4       0.6       0.8   1
                                                         𝑥

Figure 1: Graphical representation of different negations examples


The fuzzy negations constructed this way (Theorem 2.7) are called 𝜙- transforms of standard
negations.
Example 2.8. (i) If we use 𝜙(𝑥) = 𝑥𝜔 as automorphism (Theorem 2.7), we obtain Yager class
      of negations (Table 1).
                                       √
 (ii) If we use 𝜙(𝑥) = 𝑥2 ( 𝜙(𝑥) = 𝑥) as automorphism (Theorem 2.7), we obtain concrete
                                                      √            1                √
      examples of Yager class of negations 𝑁 (2) (𝑥) = 1 − 𝑥2 (𝑁 ( 2 ) (𝑥) = (1 − 𝑥)2 ), which
      is the same by evaluating Yager expression for 𝜔 = 2 and 𝜔 = 12 , respectively.

Definition 2.9. The complement of a fuzzy set 𝐴 on 𝑋 with respect to a fuzzy negation 𝑁 is the
fuzzy set 𝐴𝑁 : 𝑋 → [0, 1] defined as 𝐴𝑁 = {(𝑥𝑖 , 𝑁 (𝐴(𝑥𝑖 ))) | 𝑥𝑖 ∈ 𝑋}


2.2. Gated Recurrent Unit (GRU)
In this subsection we explain the operation of the GRU [6]. Let 𝑘 be the number of input sequence
(x), 𝑛 the hidden size of the unit (h) and 𝑇 the number of timesteps. The input weight matrices
are W𝑧𝑥 , W𝑟𝑥 , Wℎ𝑥 , ∈ R𝑛×𝑘 , the recurrent weight matrices are W𝑧ℎ , W𝑟ℎ , Wℎℎ ∈ R𝑛×𝑛
and the bias weight vectors are b𝑧 , b𝑟 , bℎ ∈ R𝑛 . The operations description for each timestep
𝑡 ∈ {1, . . . , 𝑇 } is the following.
The input values x(𝑡) ∈ R𝑘 and h(𝑡−1) ∈ R𝑛 enter to the update (Eq. 1) and reset (Eq. 2) gates.
In each of them, the value of x(𝑡) is multiplied by each of the input weight matrices (W𝑧𝑥 ,
                                                                                     h(t−1)


                    x(t)         update (z(t) )    σ                                   ×


                                  reset (r(t) )    σ                           1−


                                                  cell (h̃(t) )   tanh         ×       +         h(t)
                  h(t−1)


Figure 2: Graphical representation of GRU


W𝑟𝑥 ). The same occurs with the values of h(𝑡−1) and the recurrent weight matrices (W𝑧ℎ ,
W𝑟ℎ ). The 𝑛-dimensional vectors obtained from these multiplications are fused summing with
the corresponding bias b𝑧 , b𝑟 . As activation function non-linear sigmoid logistic function is
used coordenate-wise (𝜎 : R → [0, 1] where 𝜎(𝑥) = 1+𝑒1−𝑥 ). Update vector (z(𝑡) ) represents
the selection about which part of the current state should be removed and which part should be
retained. Reset vector (r(𝑡) ) represents a weighting about which part of the previous step state
is going to use in the calculation of the candidate activation.

                    z(𝑡) = 𝜎(W𝑧𝑥 x(𝑡) + W𝑧ℎ h(𝑡−1) + b𝑧 )                           (update gate)             (1)

                      r(𝑡) = 𝜎(W𝑟𝑥 x(𝑡) + W𝑟ℎ h(𝑡−1) + b𝑟 )                         (reset gate)              (2)
For the calculation of the candidate activation (Eq. 3), input value x(𝑡) is multiplied by Wℎ𝑥
matrix. Input value h(𝑡−1) is weighted with r(𝑡) by multiplying element-wise and the resultant
vector is multiplied by Wℎℎ matrix. As well as in the previous step, both 𝑛-dimensional
structures are summed with bℎ . As activation function of the candidate activation the hyperbolic
tangent tanh : R → [−1, 1] is used coordenate-wise.
         (𝑡)
        h̃     = tanh(Wℎ𝑥 x(𝑡) + W𝑐ℎ (r(𝑡) ∘ h(𝑡−1) ) + bℎ )                         (candidate activation)   (3)
                                                                                                 (𝑡)
The previous timestep unit vector (h(𝑡−1) ) and the candidate activation (h̃ ) are combined in
this step. The Hadamard or element-wise product (∘) is calculated between the values of the
update gate (z(𝑡) ) and the complement of the update gate respect 1 (1 − z(𝑡) ) respectively (Eq.
4). Both values are added obtaining the current timestep value of the unit output vector, h(𝑡) .
The equations that describe the explained process are the following
                                                                         (𝑡)
                           h(𝑡) = (1 − z(𝑡) ) ∘ h(𝑡−1) + z(𝑡) ∘ h̃                    (output)                (4)


3. GRU modification using fuzzy negations
Let 𝑛 be the hidden size of the GRU (Section 2). In the GRU learning process, z(𝑡) vector
represents the part of the current state is going to be retained and 1 − z(𝑡) represents the part
of the previous time step memory is forgotten. In this work, we generalize the second one,
because the operation does not have the necessity to be a 𝑛-dimensional convex combination.
                (𝑡)         (𝑡)
Being z(𝑡) = (𝑧1 , . . . , 𝑧𝑛 ) the update vector of the GRU, and having the non-empty finite
universe 𝑋 = (𝑥1 , . . . , 𝑥𝑛 ) we can interpret z(𝑡) as a fuzzy set 𝑍 on 𝑋, where each vector
          (𝑡)                                                           (𝑡)
element 𝑧𝑖 is the membership of an 𝑥𝑖 element, hence, 𝑍(𝑥𝑖 ) = 𝑧𝑖 for all 𝑖 ∈ {1, . . . , 𝑛}
having the following fuzzy set:

                                      𝑍 = {(𝑥𝑖 , 𝑍(𝑥𝑖 )) | 𝑥𝑖 ∈ 𝑋}

If the membership of a element 𝑥𝑖 ∈ 𝑋 to the fuzzy set is 0, this element is not updating
(h(𝑡) = h(𝑡−1) ), whereas if the membership is 1, is going to have a full update (h(𝑡) = h˜(𝑡) ).
Between 0 and 1 the membership and consequently the update measure is modelled by the
fuzzy set 𝑍, that is, the element updates a part, weighted by his membership to the fuzzy set.
We can obtain the complementary set of 𝑍 with respect to a fuzzy negation 𝑁 : [0, 1] → [0, 1]
the following way:

                 𝑍𝑁 = {(𝑥𝑖 , 𝑁 (𝑍(𝑥𝑖 ))) | 𝑥𝑖 ∈ 𝑋} = {(𝑥𝑖 , 𝑍𝑁 (𝑥𝑖 )) | 𝑥𝑖 ∈ 𝑋}

Here, we usually consider the standard negation 𝑁𝑍 to calculate the complement, although we
also use different expressions (Table 1 and Figure 2.2). As in the case of the construction of 𝑍,
we can obtain a vector from 𝑍𝑁 as follows:
                               (𝑡)𝑁
                              𝑧𝑖       = 𝑍𝑁 (𝑥𝑖 ) for all 𝑖 ∈ {1, . . . , 𝑛}
                       (𝑡)𝑁           (𝑡)𝑁
obtaining z(𝑡)𝑁 = (𝑧1 , . . . , 𝑧𝑛 ) This way, we modify the GRU Equation 4 replacing
1 − z(𝑡) by the vector generated from the fuzzy complementarity with respect to fuzzy negations
(Equation 5):
                                                                       (𝑡)
                                   h(𝑡) = z(𝑡)𝑁 ∘ h(𝑡−1) + z(𝑡) ∘ h̃                          (5)


4. Experimental study
In the present section, on one hand we explain the used framework (the dataset, the used neural
network architecture, ) and on the other hand we present the obtained results.

4.1. Experimental framework
4.1.1. Dataset
As the Gated Recurrent Unit improves other recurrent models in reduced datasets we have
selected a small one. The dataset we use is Text REtrieval Conference (TREC) [12], which is a
dataset for question classification. It contains 5500 questions in the training test and another
500 in the test one. The dataset is distributed in 6 classes.
      x(1)       Embedding            GRU            GRU               Linear        y(1)

      x(2)       Embedding            GRU            GRU               Linear        y(2)
                                       ...            ...
     x(𝑇 )       Embedding            GRU            GRU               Linear       y(𝑇 )

Figure 3: Graphical representation of the used double stacked GRU architecture


4.1.2. Architecture
The used architecture (Figure 4.1.2) is separated in four layers:

    • Embedding layer. It consists in an algorithm designed to reduce the input dimensionality
      into a fixed one (in this case, 50) encoding the input words by means of vectors. Words
      with close representations have a greater relation.
    • Double stacked GRU layers. Two Gated Recurrent Units with a hidden size of 64 each one.
    • Linear fully connected layer. The second GRU is fully connected like a multilayer per-
      ceptron with a 6-node layer, which is a 6-dimensional probability vector. We classify as
      member of the class the vector position value that corresponds with the maximum.


4.1.3. Training hyperparameters
In this experiment, for each negation function, 10 independent runs of 30 epoch each are
performed. The used optimization algorithm is Adam [13] and its fixed learning rate, 𝛾 =
1 × 10−3 . The selected loss function is the Cross Entropy Loss.

4.1.4. Metrics
Once the architecture has been presented, we will go on to explain the metrics we will use to
evaluate the experimental results. For each experiment 𝑖, the metric to be used is the accuracy
on test set (𝑎𝑐𝑐), calculated as follows:
                                      Number of correct predictions
                             𝑎𝑐𝑐𝑖 =                                                         (6)
                                       Total number of test dataset
for 1 ≤ 𝑖 ≤ 10 (number of experiments). For the evaluation of the experiments for each
negation function, mean (Eq. 7) and standard deviation (Eq. 8) of accuracies of 10 experiments
are calculated as follows:
                                                 10
                                             1 ∑︁
                                     𝜇𝑎𝑐𝑐 =         𝑎𝑐𝑐𝑖                                   (7)
                                            10
                                                𝑖=1
                                       ⎯
                                       ⎸ 10
                                       ⎸ 1 ∑︁
                                𝑠𝑎𝑐𝑐 = ⎷       (𝑎𝑐𝑐𝑖 − 𝜇𝑎𝑐𝑐 )2                             (8)
                                          9
                                              𝑖=1
4.2. Experimental results
The results are presented in Table 2. The table is divided in two parts, regarding to the used
negations: in the first part negations with fixed values are used and in the second part values
are learned by the GRU. For each fuzzy negation expression, 10 independent runs have been
executed and after the mean accuracy (Eq. 7) and its standard deviation (Eq. 8) are measured
(Table 2).
According to the first part of the table, we can see that√  the best accuracy value is obtained
when the expression of the circular negation 𝑁 (𝑥) = 1 − 𝑥2 is used, gaining 1.68 points
                                                 (2)

of average accuracy with respect to the standard negation. Also better results than 𝑁𝑍 are
obtained with 𝑁𝐾 (𝑥) = 1 − 𝑥2 . Taking account the first part of the table, we can resume that
the best results are obtained when we use a fuzzy negation 𝑁 fulfilling 𝑁 (𝑥) > 1 − 𝑥 for all
𝑥 ∈ (0, 1) (Table 2.2).
Regarding to the second part of the table, we have used the Sugeno class and the Yager class
fuzzy negation expressions, each one depending by the parameters 𝜆 ∈ (−1, ∞) and 𝜔 ∈ (0, ∞)
respectively. These parameters are learnt by the recurrent neural network. As we can see in
Table 2, both learnt expressions improve the ones selected by a fixed number. Concretely,
the difference between the means of accuracy of the standard negation and the best learnt
expressions is of 2.93 percentage points. This difference reflects the improvement in the use of
other expressions and specifically those learned by the neural network itself. Regarding the
average values learned by the network, for the Yager expression we obtain 𝜔 = 1.417 for the
first GRU and 𝜔 = 1.508 for the second one. The standard deviation for each one is 0.016 and
0.028, respectively. For the Sugeno expression, we obtain 𝜆 = −0.265 and 𝜆 = −0.384, with
the standard deviations 0.014 and 0.023, respectively. In both cases the standard deviations
show that for the 10 independent runs, the obtained values have had very small differences.
These learned numbers also show that better results are obtained when we use a fuzzy negation
𝑁 such that 𝑁 (𝑥) > 1 − 𝑥 for all 𝑥 ∈ (0, 1). Regarding to a overall conclusion about the
properties, we can also remark that the best 3 results are obtained using strong negations
(Definition 2.4).


5. Conclusion
In this work we have interpreted as a fuzzy set a part of a GRU architecture and we have
proposed the use of different negations to perform it. We have observed that better results are
obtained using fuzzy negations 𝑁 for which 𝑁 (𝑥) > 𝑁𝑍 (𝑥) for all 𝑥 ∈ (0, 1).
Regarding future lines of research, in the theoretical aspect our intention is to continue investi-
gating about new ways to generalize and interpret recurrent neural networks operators, such
as using 𝑛-dimensional fuzzy sets or extending the concept of fuzzy negation. On the applied
side, future lines go on modifying other architectures, as well as using these architectures to
other specific problems, such as language modelling.
Table 2
Mean Accuracy and standard deviation using different fuzzy negations
                        Name of Fuzzy Negation     Accuracy (𝜇𝑎𝑐𝑐 ± 𝑠𝑎𝑐𝑐 )
                                   𝑁𝑍                     80.64 ± 3.12
                                   𝑁𝐾                     81.70 ± 1.73
                                   𝑁𝑅                     80.82 ± 1.26
                                  𝑁 (2)                   82.32 ± 1.73
                                  𝑁(2)
                                     1
                                                          73.74 ± 5.11
                            Best Sugeno class             83.28 ± 1.57
                             Best Yager class             83.57 ± 1.05

                       85

                                                                      ●
                       80
                                      ●


                       75


                       70


                       65


                              N_Z    N_K   N_R    N^(2)   N^(1/2)   Yager   Sugeno


Acknowledgments
Grant PID2019-108392GB-I00 funded by MCIN/AEI/10.13039/501100011033 and by Tracasa
Instrumental and the Immigration Policy and Justice Department of the Government of Navarre.


References
 [1] D. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-propagating
     errors, Nature 323 (1986) 533–536.
 [2] A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, Studies in
     computational intelligence, Springer, Berlin, 2012. URL: https://cds.cern.ch/record/1503877.
     doi:10.1007/978-3-642-24797-2.
 [3] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the
     difficulty of learning long-term dependencies, in: S. C. Kremer, J. F. Kolen (Eds.), A Field
     Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.
 [4] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780.
 [5] J. van der Westhuizen, J. Lasenby, The unreasonable effectiveness of the forget gate, CoRR
     abs/1804.04849 (2018). URL: http://arxiv.org/abs/1804.04849. arXiv:1804.04849.
 [6] K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Ben-
     gio, Learning phrase representations using rnn encoder-decoder for statistical machine
     translation, 2014. arXiv:1406.1078.
 [7] L. Zadeh, Fuzzy sets, Information and Control 8 (1965) 338–353. doi:https://doi.org/
     10.1016/S0019-9958(65)90241-X.
 [8] H. Bustince, P. Burillo, F. Soria, Automorphisms, negations and implication opera-
     tors, Fuzzy Sets and Systems 134 (2003) 209–229. doi:https://doi.org/10.1016/
     S0165-0114(02)00214-2.
 [9] H. Zapata, H. Bustince, L. D. Miguel, C. Guerra, Some properties of implications via
     aggregation functions and overlap functions, International Journal of Computational
     Intelligence Systems 7 (2014) 993–1001. doi:https://doi.org/10.1080/18756891.
     2014.967005.
[10] E. Trillas, Sobre funciones de negación en la teoría de conjuntos difusos., Stochastica 3
     (1979) 47–60. URL: http://eudml.org/doc/38807.
[11] M. Baczyński, B. Jayaram, Fuzzy implications, in: Studies in Fuzziness and Soft Computing,
     2008.
[12] X. Li, D. Roth, Learning question classifiers, in: Proceedings of the 19th International
     Conference on Computational Linguistics - Volume 1, COLING ’02, Association for Com-
     putational Linguistics, USA, 2002, p. 1–7. URL: https://doi.org/10.3115/1072228.1072378.
     doi:10.3115/1072228.1072378.
[13] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, CoRR abs/1412.6980
     (2015).

</pre>