<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Drug-drug interaction ex-
traction from biomedical texts using long short-term mem-
ory network. Journal of Biomedical Informatics</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1532-0464</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1093/bioinformatics/btz682</article-id>
      <title-group>
        <article-title>BERTKG-DDI: Towards Incorporating Entity-specific Knowledge Graph Information in Predicting Drug-Drug Interactions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ishani Mondal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Microsoft Research Lab</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lavelle Road</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bengaluru</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India ishani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@gmail.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>86</volume>
      <issue>15</issue>
      <fpage>1234</fpage>
      <lpage>1240</lpage>
      <abstract>
        <p>Off-the-shelf biomedical embeddings obtained from the recently released various pre-trained language models (such as BERT, XLNET) have demonstrated state-of-the-art results (in terms of accuracy) for the various natural language understanding tasks (NLU) in the biomedical domain. Relation Classification (RC) falls into one of the most critical tasks. In this paper, we explore how to incorporate domain knowledge of the biomedical entities (such as drug, disease, genes), obtained from Knowledge Graph (KG) Embeddings, for predicting Drug-Drug Interaction from textual corpus. We propose a new method, BERTKG-DDI, to combine drug embeddings obtained from its interaction with other biomedical entities along with domain-specific BioBERT embedding-based RC architecture. Experiments conducted on the DDIExtraction 2013 corpus clearly indicate that this strategy improves other baselines architectures by 4.1% macro F1-score.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>During the concurrent administration of multiple drugs to
a patient, there seems to be a possibility in which an
ailment might get cured or it can lead to serious side-effects.
These type of interactions are known as Drug-Drug
Interactions (DDIs). Predicting drug-drug interactions (DDI) is
a difficult task as it requires to understand the underlying
action principle of the interacting drugs. Numerous efforts
by the researchers have been observed recently in terms of
automatic extraction of DDIs from the textual corpus (Sahu
and Anand 2018), (Liu et al. 2016), (Sun et al. 2019), (Li
and Ji 2019), (Mondal 2020) and predicting unknown DDI
from KG (Purkayastha et al. 2019). Automatic extraction of
DDI from texts helps to maintain large-scale databases and
thereby facilitate the medical experts in their diagnosis.</p>
      <p>In parallel to the progress of DDI extraction from the
textual corpus, some efforts have been observed recently where
the researchers came up with various strategies of
augmenting chemical structure information of the drugs and textual
description of the drugs (Zhu et al. 2020) to improve
DrugDrug Interaction prediction performance from corpus and
Knowledge Graphs. The DDI Prediction from the textual
corpus has been framed by the earlier researchers as relation
classification problem (Sahu and Anand 2018), (Liu et al.
2016), (Sun et al. 2019), (Li and Ji 2019) using CNN or
RNN-based neural networks.</p>
      <p>
        Recently, with the massive success of the pre-trained
language models
        <xref ref-type="bibr" rid="ref4">(Devlin et al. 2019)</xref>
        , (Yang et al. 2019) in
many NLP classifications, we formulate the problem of DDI
classification as a relation classification task by
leveraging both entities and contextual information. We propose a
model that leverages both domain-specific contextual
embeddings (Bio-BERT) (Lee et al. 2019) from the target
entities (drugs) and also its external information. In the recent
years, representation learning has played a pivotal role in
solving various machine learning tasks.
      </p>
      <p>In this work, we explore the direction of augmenting
graph embeddings to predict relation between two drugs
from the textual corpus. We have made use of an in-house
Knowledge Graph (Bio-KG) after curating the interactions
among drugs, diseases, genes from multiple ontologies. In
order to understand the complex underlying mechanism
of interactions among the biomedical entities, we employ
translation-based and semantics preserving heterogeneous
graph embeddings on Bio-KG and augment the entities
representation jointly to train the relation classification model.
Experiments conducted on the DDIExtraction 2013 corpus
(Herrero-Zazo et al. 2013) reveals that this method
outperforms the existing baseline models and is in line with the
new direction of research of fusing various information to
DDI prediction. In a nutshell, the major contributions of this
work are summarized as follows:
1. We propose a novel method that jointly leverages textual
and external Knowledge information to classify relation
type between the drug pairs mentioned in the text showing
the efficacy of external entity specific information.
2. Our method achieves new state-of-the-art performance on
DDI Extraction 2013 corpus.</p>
    </sec>
    <sec id="sec-2">
      <title>Problem Statement</title>
      <p>Given an input instance or sentence s with two target drug
entities d1 and d2, the task is to classify the type of relation
(y) the drugs hold between them, y 2 (y1 , ...., yN ). Here N
denotes the number of relation types.</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Text-based Relation Classification</title>
        <p>Our model for extracting DDIs from texts is based on
the pre-trained BERT-based relation classification model by
(Wu and He 2019). Given a sentence s with drugs d1 and d2,
let the final hidden state output from BERT module is H. Let
the vectors Hi to Hj are the final hidden state vectors from
BERT for entity d1, and Hk to Hm are the final hidden state
vectors from BERT for entity d2. An average operation is
applied to obtain the vector representation for each of the drug
entities. An activation operation tanh is applied followed by
a fully connected layer to each of the two vectors, and the
output for d1 and d2 are H10 and H20 respectively.</p>
        <p>H10 = W [tanh(
H20 = W [tanh(
(j
i + 1)
1
1
(m
k + 1)</p>
        <p>j
X Ht] + b
t=i
m
X Ht] + b
t=k
The weights (W ) and bias (b) parameters are shared. For
the final hidden state vector of the first token (‘[CLS]’), we
also add an activation operation and a fully connected layer,
which is formally expressed as:</p>
        <p>H00 = W0(tanh(H0)) + b0
Matrices W0, W1, W2 have the same dimensions, i.e. W0 2
Rd d ,W1 2 Rd d, W2 2 Rd d, where d is the hidden state
size from BERT. We concatenate H0, H10 and H20 and then
0
add a fully connected layer and a softmax layer, which is
expressed as :
h00 = W3[concat(H00; H10; H20)] + b3</p>
        <p>yt0 = sof tmax(h00 )
W3 2 RN 3d, and yt0 is the softmax probability output over
N . In Equations (1), (2), (3), (4) the bias vectors are b0, b1,
b2, b3. We use cross entropy as the loss function. We denote
this text-based architecture as BERT-Text-DDI.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Entity Representation from KG</title>
        <p>To infuse external information of the entities in relation
classification task, we obtain the representation of two Drug
entities mentioned in each input instance of the relation
classification task. We use an in-house heterogeneous biomedical
Knowledge Graph (Bio-KG) consisting of the interactions of
target-target, drug-drug, drug-disease, drug-target,
diseasedisease, disease-target interactions from a large number
of ontologies such as : DrugBank1, BioSNAP2, UniProt3
(The UniProt Consortium 2016). The overall statistics of
Bio-KG has been enumerated in table 1. The real-world
information/facts observed in the Bio-KG are stored as a
collection of triples in the form (h, r, t). Each triple is composed
of a head entity h 2 E, a tail entity t 2 E, and a relation r 2
(1)
(2)
(3)
(4)
(5)</p>
        <sec id="sec-3-2-1">
          <title>1https://go.drugbank.com/ 2http://snap.stanford.edu/biodata/ 3https://www.uniprot.org/</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Drug Target Disease</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Count 6512 30098 23458</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Drug-Target</title>
          <p>Target-Target</p>
          <p>Drug-Disease
Disease-Disease
Disease-Target</p>
          <p>Total Edges</p>
          <p>Count</p>
          <p>R between them, e.g., (paracetamol, treats, fever). The fact
that paracetamol is effective in curing fever is being stored
in Bio-KG. In this case, E denotes set of entities, and R
denotes the set of relations. There are three different types of E
in Bio-KG such as drugs, diseases, targets and five different
types of R such as target-target, drug-disease, drug-target,
disease-disease, disease-target interactions.</p>
          <p>
            The aim of a Knowledge Graph embedding is to embed
the entities and relations into a low-dimensional
continuous vector space, so as to simplify the computations on the
KG. They mostly use facts in the KG to perform the
embedding task, enforcing embedding to be compatible with
the facts. They provide a generalizable context about the
overall Knowledge Graph (KG) that can be used to infer
the relations. In this work, we employ some off-the-shelf
KG embeddings to encode the representation of each of the
drugs (in terms of their relationship with other entities). The
knowledge graph embeddings are computed so that they
satisfy certain properties; i.e., they follow a given KGE model.
These KGE models define different score functions that
measure the distance of two entities relative to its relation
type in the low-dimensional embedding space. These score
functions are used to train the KGE models so that the
entities connected by relations are close to each other while
the entities that are not connected are far away. Some of the
KGEs used in our experiments as explained below:
• TransE
            <xref ref-type="bibr" rid="ref3">(Bordes et al. 2013)</xref>
            : Given a fact (h, r, t), the
relation in TransE is interpreted as a translation vector r
so that the embedded entities h and t can be connected by
r, i.e., h + r t when (h, r, t) holds. The scoring function
is defined as (negative) distance between h + r and t, i.e.,
fr(h; t) =k h + r
t k
• TransR (Lin et al. 2015): Given a fact (h, r, t), TransR
first projects the entity representations h and t into the
space specific to relation r, Here Mr is a projection matrix
from the entity space to the relation space of r, the scoring
function is:
          </p>
          <p>ht = Mrh; tt = Mrt
• RESCAL (Nickel, Tresp, and Kriegel 2011): Each
relation in RESCAL is represented as a matrix which models
pairwise interactions between latent factors. The score of
a fact (h, r, t) is defined by a bi-linear function where
h, t are vector representations of the entities, and Mr is
a matrix associated with the relation. This score captures
(6)
(7)
(8)
(9)
pairwise interactions between all components of h and t:
d 1 d 1
fr(h; t) = hT Mrt = X X[Mr]ij [h]i [t]j</p>
          <p>i=0 j=0
• DistMult (Yang et al. 2015): DistMult simplifies
RESCAL by restricting Mr to diagonal matrices. For each
relation r, it introduces a vector embedding r and requires
Mr = diag(r). The scoring function is defined as:
fr(h; t) = hT diag(r)t =
This score captures pairwise interactions between only the
components of h and t along the same dimension, and
reduces the number of parameters to O(d) per relation.
From Bio-KG, we train these KG Embeddings and obtain
the representation of all the nodes. In our case, we are only
interested in obtaining the representation of drug nodes. We
denote the KG representation of drug d as kge.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>BERTKG-DDI</title>
        <p>From the input instance s with two tagged target drug
entities d1 and d2, we obtain the KG embedding representation
of two drugs kge1 and kge2 respectively using Bio-KG. We
concatenate these two embeddings kge1 and kge2 and pass
those through a fully connected layer as represented below:
kge = W [concat(kge1; kge2)] + b
(10)
W and b are the parameters of the fully-connected layer of
the KG representation of kge1 and kge2. The final layer of
BERTKG-DDI model contains concatenation of all the
previous text-based outputs and drug representation from KG
as expressed below:
o0 = W3[concat(H00; H10; H20; kge)] + b3
yt0 = sof tmax(o0 )
(11)
(12)
Finally the training optimization is achieved using the
crossentropy loss.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <sec id="sec-4-1">
        <title>Dataset and Pre-processing</title>
        <p>We have followed the task setting of Task 9.2 in the
DDIExtraction 2013 shared task (Herrero-Zazo et al. 2013) for
evaluation. It consists of MEDLINE documents annotated with
the drug mentions and five types of interactions: Mechanism,
Effect, Advice, Interaction and Other. The task is a
multiclass classification to classify each of the drug pairs in the
sentences into one of the types and we evaluate using three
standard evaluation metrics such as: Precision (P), Recall
(R) and F1-score (F1).</p>
        <p>During pre-processing, we obtain the DRUG mentions in
the corpus and map those into unique DrugBank 4
identifiers. This is a step for converting the drug mentions into</p>
        <sec id="sec-4-1-1">
          <title>4https://go.drugbank.com/</title>
          <p>Embeddings on BERT-Text-DDI</p>
          <p>bert-base-cased
scibert-scivocab-uncased
biobert v1.0 pubmed pmc
biobert v1.1 pubmed
their respective DrugBank ID, a step of entity linking
(Mondal et al. 2019), (Leaman, Dogan, and lu 2013). This
mention normalization has been performed based on the longest
overlap of drug mentions in DrugBank and map the drugs to
different Knowledge sources used to construct Bio-KG.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Training Details</title>
        <p>
          For the purpose of experiments, we use the initialization
of various pre-trained contextual embeddings. For instance,
we use the embeddings such as bert-base-cased 5,
scibertscivocab-uncased
          <xref ref-type="bibr" rid="ref2">(Beltagy, Lo, and Cohan 2019)</xref>
          6 and
domain-specific biobert v1.0 pubmed pmc and biobert v1.0
pubmed7 as the initialization of the transformer encoder in
BERTKG-DDI. We uniformly keep the maximum sequence
length as 300 for all the embedding ablations and trained for
5 epochs. For the KG embeddings, we use word embeddings
dimensions to be 200. Stochastic Gradient Descent (SGD)
was used for optimization with an initial learning rate of
0.0001 and the model is trained for 300 epochs. After
training the embeddings, we obtain the final representation of
each drug. For the drugs mentioned in the input instance,
we make use of the obtained embeddings as shown in the
equation 11. We initialize the non-normalized drugs using
pre-trained word2vec (of dimension 200 same as the KG
embedding) trained on PubMED 8.
        </p>
        <sec id="sec-4-2-1">
          <title>5https://huggingface.co/bert-base-cased 6https://github.com/allenai/scibert 7https://github.com/dmis-lab/biobert 8http://evexdb.org/pmresources/ngrams/PubMed/</title>
          <p>
            Methods
(Zhang et al. 2017)
(Vivian et al. 2017)
            <xref ref-type="bibr" rid="ref1">(Asada, Miwa, and Sasaki 2018)</xref>
            (Sun et al. 2019)
(Zhu et al. 2020)
Our method (BERTKG-DDI)
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>In this section, we provide a detailed analysis of the various
results and findings that we have observed during
experiments. We show empirical results based on BERTKG-DDI
for both text and KG information.</p>
      <p>Ablation of Embeddings on BERT-Text-DDI: During
ablation analysis, we observe that the incorporation of
domain-specific information in biobert v.1 pubmed boosts
up the predictive performance in terms of macro-F1 score
(across all relation types) by 2.3% compared to
bert-basecased. Moreover, the scibert-vocab-cased embedddings due
to the scientific details obtained during fine-tuning achieves
reasonable boost in performance. biobert v.1 pubmed based
BERT-Text-DDI is the best-performing text-based relation
classification model. The results are enumerated in Table 2.</p>
      <sec id="sec-5-1">
        <title>Ablation analysis of KG Embeddings on BERTKG</title>
        <p>DDI: We compare the different KG embeddings for drugs
obtained from Bio-KG after augmenting with the
BERTText-DDI model in Table 3. The semantic-matching models
such as RESCAL and DistMult measure plausibility of facts
by matching the latent semantics of both relations and
entities in their vector space. In our experiments, they seem
to outperform the translation-based KGE such as TransE
and TransR by an average of 1% macro F1-score.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Advantage of KG information on BERTKG-DDI: During</title>
        <p>empirical analysis of the BERTKG-DDI model, we observe
how much performance gain can be achieved by augmenting
KG embeddings. From the results enumerated in terms of
macro F1-score on all the relation types in Table 4, we
observe that the best-performing BERT-Text-DDI model
achieves a performance boost of 1.8% after augmenting KG
information in BERTKG-DDI.</p>
        <p>
          Comparison with the existing baselines: We compare our
best-performing model with some of the best-performing
existing baselines.
          <xref ref-type="bibr" rid="ref1">(Asada, Miwa, and Sasaki 2018)</xref>
          proposed a novel neural method to extract drug-drug
interactions (DDIs) from texts using external drug molecular
structure information. They encode textual drug pairs with
convolutional neural networks and their molecular pairs with
graph convolutional networks (GCNs), and then concatenate
the outputs of these two networks. (Vivian et al. 2017)
proposed an effective model that classifies DDIs from the
literature by combining an attention mechanism and a
recurrent neural network with long short-term memory (LSTM)
units. (Zhang et al. 2017) has presented a hierarchical
recurrent neural networks (RNNs)-based method to integrate the
SDP and sentence sequence for DDI extraction task. (Sun
et al. 2019) has proposed a novel recurrent hybrid
convolutional neural network (RHCNN) for DDI extraction from
biomedical literature. In the embedding layer, the texts
mentioning two entities are represented as a sequence of
semantic embeddings and position embeddings. In particular, the
complete semantic embedding is obtained by the
information fusion between a word embedding and its contextual
information which is learnt by recurrent structure. Recently,
(Zhu et al. 2020) proposed multiple entity-aware attentions
with various entity information to strengthen the
representations of drug entities in sentences. They integrate drug
descriptions from Wikipedia and DrugBank to our model
to enhance the semantic information of drug entities. Also,
they modified the output of the BioBERT model and the
results show that it is better than using the BioBERT model
directly. On the contrary, our method achieves the
state-ofthe-art performance based on the results on the DDI
Extraction 2013 corpus (in terms of F1-scores of all the relation
types) as shown in Table 5.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we propose an approach, BERTKG-DDI, for
DDI relation classification based on pre-trained language
models and Knowledge Graph Embedding of the drug
entities. Experiments conducted on a benchmark DDI dataset
proves the effectiveness of our proposed method. Possible
directions of further research might be to explore other
external drug representation such as chemical structure, textual
description in predicting DDI from textual corpus.</p>
      <p>Leaman, R.; Dogan, R.; and lu, Z. 2013. DNorm:
Disease Name Normalization with Pairwise Learning to
Rank. Bioinformatics (Oxford, England) 29. doi:10.1093/
bioinformatics/btt474.</p>
      <p>Li, D.; and Ji, H. 2019. Syntax-aware Multi-task Graph
Convolutional Networks for Biomedical Relation Extraction. In
Proceedings of the Tenth International Workshop on Health
Text Mining and Information Analysis (LOUHI 2019), 28–
33. Hong Kong: Association for Computational Linguistics.
doi:10.18653/v1/D19-6204. URL https://www.aclweb.org/
anthology/D19-6204.</p>
      <p>Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu, X. 2015.
Learning Entity and Relation Embeddings for Knowledge Graph
Completion. In Proceedings of the Twenty-Ninth AAAI
Conference on Artificial Intelligence, AAAI’15, 2181–2187.
AAAI Press. ISBN 0262511290.</p>
      <p>Mondal, I. 2020. BERTChem-DDI : Improved Drug-Drug
Interaction Prediction from text using Chemical Structure
Information. In Proceedings of Knowledgeable NLP: the
First Workshop on Integrating Structured Knowledge and
Neural Networks for NLP, 27–32. Suzhou, China:
Association for Computational Linguistics. URL https://www.
aclweb.org/anthology/2020.knlp-1.4.</p>
      <p>Purkayastha, S.; Mondal, I.; Sarkar, S.; Goyal, P.; and
Pillai, J. K. 2019. Drug-Drug Interactions Prediction Based on
Drug Embedding and Graph Auto-Encoder. In 2019 IEEE
19th International Conference on Bioinformatics and
Bioengineering (BIBE), 547–552.</p>
      <p>Sun, X.; Dong, K.; Ma, L.; Sutcliffe, R.; He, F.; Chen, S.;
and Feng, J. 2019. Drug-Drug Interaction Extraction via
Recurrent Hybrid Convolutional Neural Networks with an
Improved Focal Loss. Entropy 21(1): 37. ISSN
10994300. doi:10.3390/e21010037. URL http://dx.doi.org/10.
3390/e21010037.
Wu, S.; and He, Y. 2019. Enriching Pre-trained
Language Model with Entity Information for Relation
Classification. CoRR abs/1905.08284. URL http://arxiv.org/abs/
1905.08284.</p>
      <p>Yang, B.; tau Yih, W.; He, X.; Gao, J.; and Deng, L. 2015.
Embedding Entities and Relations for Learning and
Inference in Knowledge Bases. CoRR abs/1412.6575.
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov,
R.; and Le, Q. V. 2019. XLNet: Generalized Autoregressive
Pretraining for Language Understanding. In NeurIPS.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Asada</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Miwa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Sasaki,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Enhancing Drug-Drug Interaction Extraction from Texts by Molecular Structure Information</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          ,
          <fpage>680</fpage>
          -
          <lpage>685</lpage>
          . Melbourne, Australia:
          <article-title>Association for Computational Linguistics</article-title>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>P18</fpage>
          -2108. URL https://www.aclweb.org/anthology/P18- 2108.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Lo,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Usunier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <article-title>Garc´ıa-Dura´n, A</article-title>
          .;
          <string-name>
            <surname>Weston</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Yakhnenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Translating Embeddings for Modeling Multi-relational Data</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>