=Paper= {{Paper |id=Vol-2172/p7-labda_tass2018 |storemode=property |title=LABDA at TASS-2018 Task 3: Convolutional Neural Networks for Relation Classification in Spanish eHealth documents |pdfUrl=https://ceur-ws.org/Vol-2172/p7-labda_tass2018.pdf |volume=Vol-2172 |authors=Víctor Suárez-Paniagua,Isabel Segura-Bedmar,Paloma Martínez |dblpUrl=https://dblp.org/rec/conf/sepln/Suarez-Paniagua18 }} ==LABDA at TASS-2018 Task 3: Convolutional Neural Networks for Relation Classification in Spanish eHealth documents== https://ceur-ws.org/Vol-2172/p7-labda_tass2018.pdf
                    TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 71-76




 LABDA at TASS-2018 Task 3: Convolutional Neural
   Networks for Relation Classification in Spanish
                eHealth documents
          LABDA en TASS-2018 Task 3: Redes Neuronales
         Convolucionales para Clasificación de Relaciones en
            documentos electrónicos de salud en español
       Vı́ctor Suárez-Paniagua1 , Isabel Segura-Bedmar2 , Paloma Martı́nez3
                              Computer Science Department
                             Carlos III University of Madrid
                              Leganés 28911, Madrid, Spain
                        1
                          vspaniag,2 isegura,3 pmf@inf.uc3m.es

       Resumen: Este trabajo presenta la participación del equipo LABDA en la subta-
       rea de clasificación de relaciones entre dos entidades identificadas en documentos
       electronicos de salud (eHealth) escritos en español. Usamos una Red Neuronal Con-
       volucional con el word embedding y el position embedding de cada palabra para
       clasificar el tipo de la relación entre dos entidades de la oración. Anteriormente, este
       método de aprendizaje automático ya ha mostrado buen rendimiento para capturar
       las caracterı́sticas relevantes en documentos electronicos de salud los cuales descri-
       ben relaciones. Nuestra arquitectura obtuvo una F1 de 44.44 % en el escenario 3 de la
       tarea, llamada como Setting semantic relationships. Solo cinco equipos presentaron
       resultados para la subtarea. Nuestro sistema alcanzó el segundo F1 más alto, siendo
       muy similar al resultado más alto (micro F1=44.8 %) y más alto que el resto de los
       equipos. Una de las principales ventajas de nuestra aproximación es que no requiere
       ningún recurso de conocimiento externo como caracterı́sticas.
       Palabras clave: Extraccion de relaciones, aprendizaje profundo, redes neuronales
       convolucionales, textos biomédicos
       Abstract: This work presents the participation of the LABDA team at the subtask
       of classification of relationships between two identified entities in electronic health
       (eHealth) documents written in Spanish. We used a Convolutional Neural Network
       (CNN) with the word embedding and the position embedding of each word to clas-
       sify the type of the relation between two entities in the sentence. Previously, this
       machine learning method has already showed good performance for capturing the
       relevant features in electronic health documents which describe relationships. Our
       architecture obtained an F1 of 44.44 % in the scenario 3 of the shared task, named
       as Setting semantic relationships. Only five teams submitted results for this subtask.
       Our system achieved the second highest F1, being very similiar to the top score (mi-
       cro F1=44.8 %) and higher than the remainig teams. One of the main advantage of
       our approach is that it does not require any external knowledge resource as features.
       Keywords: Relation Classification, Deep Learning, Convolutional Neural Network,
       biomedical texts
1    Introduction                                                tract and analyse information from electronic
                                                                 health (eHealth) documents can significantly
Nowadays, there is a high increase in the                        reduce the workload of doctors.
publication of scientific articles every year,
which demonstrates that we are living in an                         The TASS workshop proposes shared
emerging knowledge era. This explosion of                        tasks on sentiment analysis in Spanish each
information makes it nearly impossible for                       year. Concretely, the goal of TASS-2018 Task
doctors and biomedical researchers to keep                       3 (Martı́nez-Cámara et al., 2018) is to create
up to date with the literature in their fields.                  a competition where Natural Language Pro-
The development of automatic systems to ex-                      cessing (NLP) experts can train their sys-
ISSN 1613-0073                      Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.
                          Víctor Suárez-Paniagua, Isabel Segura-Bedmar y Paloma Martínez



tems for extracting the relevant information                 1,012 relations and 1,385 roles, the develop-
in Spanish eHealth documents and evaluate                    ment set contains another 285 sentences. The
them in a objective and fair way.                            dataset contains 3,276 entities and 1,012 re-
    Recently, Deep Learning has had a big im-                lations and 1,385 roles in the train set, the
pact on NLP tasks becoming the state-of-the-                 development set contains 285 sentences. A
art technique. Convolutional Neural Network                  detailed description of the method used to
(CNN) is a Deep Learning architecture which                  collect and process documents can be found
has shown good performance in Computer                       in (Martı́nez-Cámara et al., 2018).
Vision task such as image classification (Kriz-                  Unlike the other two previous subtasks,
hevsky, Sutskever, y Hinton, 2012) and face                  the documents include annotated entities
recognition (Lawrence et al., 1997).                         with boundaries and types. In this way, it
    The system described in (Kim, 2014) was                  is possible to measure and compare the diffe-
the first work to use a CNN for a NLP task.                  rent approaches only focusing on the goal of
It created a vector representation for each                  the subtask C.
sentence by extracting the relevant informa-
tion with different filters in order to clas-                2.1      Pre-processing phase
sify them into predefined categories obtai-                  As some of the relationships types are asym-
ning good results. In addition, CNN was used                 metrical, for each pair of entities marked
with good performance for relation classifica-               in the sentence, we generate two instances.
tion between nominals in the work of (Zeng                   Thus, a sentence with n entities will have
et al., 2014). Furthermore, this architectu-                 (n − 1) × n instances. Each instance is la-
re has been also used in the biomedical do-                  belled with one of the six classes is-a, part-
main for the extraction of drug-drug inter-                  of, property-of, same-as, subject and target.
actions in (Suárez-Paniagua, Segura-Bedmar,                 In addition, a None class is also considered
y Martı́nez, 2017a). This system did not                     for the non-relationship between the entities.
require any external biomedical knowledge                    Due to the fact that there are some over-
in order to provide very close results to                    lapped entities, we consider each sentence as
those obtained using lots of hand-crafted                    a graph where the vertices are the entities
features. We also employed the same ap-                      and the edges are the non-overlapped entities
proach of (Suárez-Paniagua, Segura-Bedmar,                  with itself in order to obtain recursively all
y Martı́nez, 2017b), which was used for ex-                  the possible paths without overlapping, thus
tracting relationships between keyphrases in                 we have different instances for each overlap-
the Semeval-2017 Task 10: ScienceIE (Au-                     ped entities. Table 2 shows the resulting num-
genstein et al., 2017), which proposed very                  ber of instances for each class on the train,
similar subtasks than those defined in TASS-                 validation and test sets.
2018 Task 3.                                                    Label              Train   Validation   Test
    In this work, we describe the participation
                                                                is-a               238     299          41
of the LABDA at the subtask C in the classifi-
cation of relationships between two identified                  part-of            222     171          36
entities in Spanish documents about health.                     property-of        600     366          84
In this subtask, the test dataset includes the                  same-as            42      19           8
text, the boundaries and the types of their                     subject            1018    636          206
entities to generate the prediction.                            target             1510    988          308
                                                                None               27112   20631        5265
2   Dataset
                                                             Tabla 2: Number of instances for each rela-
The task provides an annotated corpus from                   tionship type in each dataset: train, valida-
MedlinePlus documents which is divided into                  tion and test.
the training set for the learning step, develop-
ment set for the validation and test set for the                After that, we tokenize and clean the
evaluation of the systems.                                   sentences following a similar approach as
   The relationship between entities defined                 that described in (Kim, 2014), converting
as concepts are: is-a, part-of, property-of and              the numbers to a common name, words to
same-as. There are also relationships defined                lower-case, replacing special Spanish accents
as roles: subject and target. The training set               to Unicode, e.g ñ to n, and separating spe-
contains 559 sentences with 3,276 entities,                  cial characters with white spaces by regular
                                                       72
      LABDA at TASS-2018 Task 3: Convolutional Neural Networks for Relation Classification in Spanish eHealth Documents



    Relationship between entities           Instances after entity blinding                                      Label
    (ataque de asma → produce)              ’un entity1 se entity2 cuando los entity0 entity0 .’                 None
    (ataque de asma ← produce)              ’un entity2 se entity1 cuando los entity0 entity0 .’                 target
    (ataque de asma → sı́ntomas)            ’un entity1 se entity0 cuando los entity2 entity0 .’                 None
    (ataque de asma ← sı́ntomas)            ’un entity2 se entity0 cuando los entity1 entity0 .’                 None
    (ataque de asma → empeoran)             ’un entity1 se entity0 cuando los entity0 entity2 .’                 None
    (ataque de asma ← empeoran)             ’un entity2 se entity0 cuando los entity0 entity1 .’                 None
    (produce → sı́ntomas)                   ’un entity0 se entity1 cuando los entity2 entity0 .’                 None
    (produce ← sı́ntomas)                   ’un entity0 se entity2 cuando los entity1 entity0 .’                 None
    (produce → empeoran)                    ’un entity0 se entity1 cuando los entity0 entity2 .’                 subject
    (produce ← empeoran)                    ’un entity0 se entity2 cuando los entity0 entity1 .’                 None
    (sı́ntomas → empeoran)                  ’un entity0 se entity0 cuando los entity1 entity2 .’                 None
    (sı́ntomas ← empeoran)                  ’un entity0 se entity0 cuando los entity2 entity1 .’                 target
    (asma → produce)                        ’un ataque de entity1 se entity2 cuando los entity0 entity0 .’       None
    (asma ← produce)                        ’un ataque de entity2 se entity1 cuando los entity0 entity0 .’       None
    (asma → sı́ntomas)                      ’un ataque de entity1 se entity0 cuando los entity2 entity0 .’       None
    (asma ← sı́ntomas)                      ’un ataque de entity2 se entity0 cuando los entity1 entity0 .’       None
    (asma → empeoran)                       ’un ataque de entity1 se entity0 cuando los entity0 entity2 .’       None
    (asma ← empeoran)                       ’un ataque de entity2 se entity0 cuando los entity0 entity1 .’       None
    (produce → sı́ntomas)                   ’un ataque de entity0 se entity1 cuando los entity2 entity0 .’       None
    (produce ← sı́ntomas)                   ’un ataque de entity0 se entity2 cuando los entity1 entity0 .’       None
    (produce → empeoran)                    ’un ataque de entity0 se entity1 cuando los entity0 entity2 .’       subject
    (produce ← empeoran)                    ’un ataque de entity0 se entity2 cuando los entity0 entity1 .’       None
    (sı́ntomas → empeoran)                  ’un ataque de entity0 se entity0 cuando los entity1 entity2 .’       None
    (sı́ntomas ← empeoran)                  ’un ataque de entity0 se entity0 cuando los entity2 entity1 .’       target

Tabla 1: Instances with two different entities relationship after the pre-processing phase with
entity blinding of the sentence ’Un ataque de asma se produce cuando los sı́ntomas empeoran.’.

expressions.                                                      3     CNN model
    Furthermore, the two target entities of                       In this section, we present the CNN archi-
each instance are replaced by the labels ”en-                     tecture which is used for the task of relation
tity1 ”, ”entity2 ”, and by ”entity0 ”for the re-                 extraction in electronic health documents. Fi-
maining entities. This method is known as                         gure 2 shows the entire process of the CNN
entity blinding, and supports the generaliza-                     starting from a sentence with marked entities
tion of the model. For instance, the sentence                     to return the prediction.
in Figure 1: ’Un ataque de asma se produ-
ce cuando los sı́ntomas empeoran.’ with the                       3.1      Word table layer
entities ataque de asma, asma, produce, sı́nto-                   After the pre-processing phase, we created an
mas and empeoran should be transformed to                         input matrix suitable for the CNN architec-
the relation instances showed in Table 1.                         ture. The input matrix should represent all
                                                                  training instances for the CNN model; there-
                                                                  fore, they should have the same length. We
                                                                  determined the maximum length of the sen-
                                                                  tence in all the instances (denoted by n), and
Figura 1: Relationships and entities in the                       then extended those sentences with lengths
sentence ’Un ataque de asma se produce                            shorter than n by padding with an auxiliary
cuando los sı́ntomas empeoran.’.                                  token ”0 ”.
                                                                      Moreover, each word has to be represented
                                                                  by a vector. To do this, we randomly initia-
    We observed that there are some instan-                       lized a vector for each different word which
ces that involve relationships between an en-                     allows us to replace each word by its word
tity and its overlapped entity, for this reason,                  embedding vector: We ∈ R|V |×me where V
we remove them from the dataset because we                        is the vocabulary size and me is the word
can not deal with these relations in the en-                      embedding dimension. Finally, we obtained a
tity blinding process. Moreover, there are re-                    vector x = [x1 ; x2 ; ...; xn ] for each instance
lationships with more than one label, in this                     where each word of the sentence is represen-
case, we take just one label because our sys-                     ted by its corresponding word vector from the
tem is not able to cope with a multi-class                        word embedding matrix. We denote p1 and
problem.                                                          p2 as the positions in the sentence of the two
                                                             73
                                    Víctor Suárez-Paniagua, Isabel Segura-Bedmar y Paloma Martínez




                  Un ataque de asma<\e1> se produce<\e2> cuando los síntomas<\e0> empeoran<\e0>.


                                              Preprocessing

                                      We                         Wd1           Wd2


                                                                                       2n-1
                     |V|

                                                           md              md


                                      me
                                                        Position        Position
                                Word embeddings        embeddings      embeddings
                   Un
              entity1
                   se
              entity2
              cuando
                   los
              entity0
              entity0
          n           .                                                                     n-w+1                          m               k
                     0
                  …
                                                                                                                               Ws   o
                                                                                       w
                    0                                                                                                 z
                    0                                                                                     S
                                                   X
                                           Look-up table layer                             Convolutional layer   Pooling   Softmax layer
                                                                                                                  layer    with dropout



  Figura 2: CNN model for the Setting semantic relationships subtask of TASS-2018-Task 3.

entities to be classified.                                                           that in Figure 2, we represent the total num-
    The following step involves calculating the                                      ber of filters, denoted by m, with the same
relative position of each word to the two can-                                       size w in a matrix S ∈ R(n−w+1)×m . However,
didate entities as i − p1 and i − p2 , where i                                       the same process can be applied to filters with
is the word position in the sentence (padded                                         different sizes by creating additional matrices
word included), in the same way as (Zeng et                                          that would be concatenated in the following
al., 2014). In order to avoid negative values,                                       layer.
we transformed the range (−n + 1, n − 1) to
the range (1, 2n − 1). Then, we mapped these                                         3.3        Pooling layer
distances into a real value vector using two                                         In this layer, the goal is to extract the most
position embeddings Wd1 ∈ R(2n−1)×md and                                             relevant features of each filter using an aggre-
Wd2 ∈ R(2n−1)×md . Finally, we created an                                            gating function. We used the max function,
input matrix X ∈ Rn×(me +2md ) which is re-                                          which produces a single value in each filter
presented by the concatenation of the word                                           as zf = max{s} = max{s1 ; s2 ; ...; sn−w+1 }.
embeddings and the two position embeddings                                           Thus, we created a vector z = [z1 , z2 , ..., zm ],
for each word in the instance.                                                       whose dimension is the total number of filters
                                                                                     m representing the relation instance. If the-
3.2   Convolutional layer                                                            re are filters with different sizes, their output
Once we obtained the input matrix, we ap-                                            values should be concatenated in this layer.
plied a filter matrix f = [f1 ; f2 ; ...; fw ] ∈
Rw×(me +2md ) to a context window of size w                                          3.4        Softmax layer
in the convolutional layer to create higher                                          Prior to performing the classification, we per-
level features. For each filter, we obtained                                         formed a dropout to prevent overfitting. We
a score sequence s = [s1 ; s2 ; ...; sn−w+1 ] ∈                                      obtained a reduced vector zd , randomly set-
R(n−w+1)×1 for the whole sentence as                                                 ting the elements of z to zero with a probabi-
                   w
                                                                                     lity p following a Bernoulli distribution. After
           si = g(
                  X
                     fj xTi+j−1 + b)                                                 that, we fed this vector into a fully connec-
                                                                                     ted softmax layer with weights Ws ∈ Rm×k
                          j=1
                                                                                     to compute the output prediction values for
where b is a bias term and g is a non-linear                                         the classification as o = zd Ws + d where d
function (such as tangent or sigmoid). Note                                          is a bias term; we have k = 6 classes in the
                                                                          74
       LABDA at TASS-2018 Task 3: Convolutional Neural Networks for Relation Classification in Spanish eHealth Documents



            Label              Correct       Missing      Spurious        Precision       Recall       F1
            is-a               8             61           8               50 %            11.59 %      18.82 %
            part-of            5             27           5               50 %            15.63 %      23.81 %
            property-of        9             53           12              42.86 %         14.52 %      21.69 %
            same-as            1             4            0               100 %           20 %         33.33 %
            subject            50            87           37              57.47 %         36.5 %       44.64 %
            target             113           99           72              61.08 %         53.3 %       56.93 %
            Scenario 3         186           331          134             58.12 %         35.98 %      44.44 %

             Tabla 3: Results over the test set using a CNN with position embedding.

dataset and the ”Noneçlass. At test time, the                       Parameter                                             Value
vector z of a new instance is directly classified                    Maximal length in the dataset, n                      38
by the softmax layer without a dropout.                              Word embeddings dimension, Me                         300
                                                                     Position embeddings dimension, Md                     10
3.5    Learning                                                      Filter window sizes, w                                3, 4, 5
For the training phase, we need to learn the                         Filters for each window size, m                       200
CNN parameter set θ = (We , Wd1 , Wd2 ,                              Dropout rate, p                                       0.5
Ws , d, Fm , b), where Fm are all of the m                           Non-linear function, g                                ReLU
filters f. For this purpose, we used the condi-                      Mini-batch size                                       50
tional probability of a relation r obtained by                       Learning rate                                         0.001
the softmax operation as

                          exp(or )                                 Tabla 4: The CNN model parameters and
            p(r|x, θ) = Pk                                         their values used for the results.
                         l=1 exp(ol )

to minimize the cross entropy function for all                         Table 3 shows the results of the CNN con-
instances (xi ,yi ) in the training set T as fo-                   figuration with position embeddings. We ob-
llows                                                              serve that the number of Missing is very high.
                     XT                                            This may be due to the fact that the dataset
          J(θ) =        log p(yi |xi , θ)                          is very unbalanced and these instances are
                       i=1                                         classified as None by the system. In fact, we
In addition, we minimize the objective fun-                        see that the classes that are more represen-
ction by using stochastic gradient descent                         tative have better Recall. To solve this pro-
over shuffled mini-batches and the Adam up-                        blem we propose to use sampling techniques
date rule (Kingma y Ba, 2014) to learn the                         to increase the number of instances of the less
parameters.                                                        representative classes.
                                                                       Only five teams submitted results for this
4     Results and Discussion                                       subtask. Our system achieved the second hig-
                                                                   hest F1, being very similiar to the top sco-
The CNN model was training with the trai-                          re (micro F1=44.8 %), but very much higher
ning set and we obtained the best values of                        than the other teams, which are bellow than
each parameters fine-tuning them on the va-                        11 % of F1. One of the main advantage of our
lidation set (see Table 4).                                        approach is that it does not require any ex-
   The results were measured with precision                        ternal knowledge resource.
(P), recall (R) and F1, defined as:

         C                 C                     P ×R
                                                                   5     Conclusions and Future work
 P =               R=                  F1 = 2                      In this paper, we propose a CNN model for
       C +S              C +M                    P +R
                                                                   the subtask C (Setting semantic relations-
where Correct (C) are the relations that mat-                      hips) of the TASS-2018 Task 3. The official
ched to the test set and the prediction, Mis-                      results for this model show that the CNN is
sing (M) are the relations that are in the test                    a very promising system because neither ex-
set but not in the prediction, and Spurious                        pert domain knowledge nor external features
(S) are the relations that are in the predic-                      are needed. The configuration of the architec-
tion but not in the test set.                                      ture is very simple with a basic preprocessing
                                                              75
                          Víctor Suárez-Paniagua, Isabel Segura-Bedmar y Paloma Martínez



adapted for Spanish documents.                               Krizhevsky, A., I. Sutskever, y G. E. Hinton.
    The results show that the system produ-                    2012. Imagenet classification with deep
ces very many false negatives. We think that                   convolutional neural networks. En Advan-
this may be due to the unbalanced nature of                    ces in Neural Information Processing Sys-
the dataset. To solve this problem, we propo-                  tems 25, páginas 1097–1105. Curran As-
se to use oversampling techniques to increa-                   sociates, Inc.
se the number of instances of the less repre-
                                                             Lawrence, S., C. L. Giles, A. C. Tsoi, y
sentative classes. Our system also seems to
                                                               A. D. Back. 1997. Face recognition:
have difficulties in order to distinguish the
                                                               a convolutional neural-network approach.
directionality of the relationships. For these
                                                               IEEE Transactions on Neural Networks,
reasons, we will use more complex settings of
                                                               8(1):98–113, Jan.
the architecture for tackling the directiona-
lity problem.                                                Martı́nez-Cámara, E., Y. Almeida-Cruz,
    Moreover, we plan to use external features                 M. C. Dı́az-Galiano, S. Estévez-Velarde,
as part of the embeddings such as the entity                   M. A. Garcı́a-Cumbreras, M. Garcı́a-
labels given by the second subtask, the Part-                  Vega, Y. Gutiérrez, A. Montejo-Ráez,
of-Speech (PoS) tags and the dependency ty-                    A. Montoyo, R. Muñoz, A. Piad-
pes of each word for the Spanish documents                     Morffis, y J. Villena-Román.       2018.
in order to increase the information of each                   Overview of TASS 2018: Opinions,
sentence. We want to explore in detail each                    health and emotions. En E. Martı́nez-
feature contribution and the fine-tune all the                 Cámara Y. Almeida Cruz M. C. Dı́az-
parameters. Furthermore, we will use some                      Galiano S. Estévez Velarde M. A.
rules to distinguish the relations and the roles               Garcı́a-Cumbreras     M.     Garcı́a-Vega
with the entity labels and train two different                 Y. Gutiérrez Vázquez A. Montejo Ráez
classifier, thus, they would be more accurate.                 A. Montoyo Guijarro R. Muñoz Guillena
In addition, we will use another neural net-                   A. Piad Morffis, y J. Villena-Román,
work architectures like the Recurrent Neural                   editores, Proceedings of TASS 2018:
Network and possible combinations with the                     Workshop on Semantic Analysis at
CNN.                                                           SEPLN (TASS 2018), volumen 2172 de
                                                               CEUR Workshop Proceedings, Sevilla,
Funding                                                        Spain, September. CEUR-WS.
This work was supported by the Research                      Suárez-Paniagua, V., I. Segura-Bedmar, y
Program of the Ministry of Economy and                          P. Martı́nez. 2017a. Exploring convolu-
Competitiveness - Government of Spain,                          tional neural networks for drug-drug inter-
(DeepEMR project TIN2017-87548-C2-1-R).                         action extraction. Database, 2017:bax019.
Bibliografı́a                                                Suárez-Paniagua, V., I. Segura-Bedmar, y
                                                                P. Martı́nez. 2017b. LABDA at semeval-
Augenstein, I., M. Das, S. Riedel, L. Vikra-
                                                                2017 task 10: Relation classification bet-
  man, y A. McCallum. 2017. Semeval
                                                                ween keyphrases via convolutional neural
  2017 task 10: Scienceie - extracting keyph-
                                                                network. En Proceedings of the 11th In-
  rases and relations from scientific publi-
                                                                ternational Workshop on Semantic Eva-
  cations. En Proceedings of the 11th In-
                                                                luation, SemEval@ACL 2017, Vancouver,
  ternational Workshop on Semantic Eva-
                                                                Canada, August 3-4, 2017, páginas 969–
  luation (SemEval-2017), páginas 546–555,
                                                                972.
  Vancouver, Canada, August. Association
  for Computational Linguistics.                             Zeng, D., K. Liu, S. Lai, G. Zhou, y J. Zhao.
                                                               2014. Relation classification via convo-
Kim, Y. 2014. Convolutional neural net-                        lutional deep neural network. En Pro-
  works for sentence classification. En Pro-                   ceedings of the 25th International Confe-
  ceedings of the 2014 Conference on Em-                       rence on Computational Linguistics (CO-
  pirical Methods in Natural Language Pro-                     LING 2014), Technical Papers, páginas
  cessing (EMNLP), páginas 1746–1751.                         2335–2344, Dublin, Ireland, August. Du-
Kingma, D. P. y J. Ba. 2014. Adam: A met-                      blin City University and Association for
  hod for stochastic optimization. CoRR,                       Computational Linguistics.
  abs/1412.6980.
                                                       76