<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miao Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fang Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ganhui Lan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victor Lobanov</string-name>
          <email>victor.lobanovg@covance.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Covance, 206 Carnegie Center</institution>
          ,
          <addr-line>Princeton, NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Covance</institution>
          ,
          <addr-line>8211 SciCor Drive, Indianapolis, IN</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Transformer deep learning models, such as BERT, have demonstrated their effectiveness over previous baselines on a broad range of general-domain natural language processing (NLP) tasks such as classification, named entity recognition, and question answering (Devlin et al. 2018). They also exhibit enhanced performance in domain-specific NLP tasks, including BioNLP tasks (Lee et al. 2019; Alsentzer et al. 2019). In this study, we focus on clinical trial protocols: exploring and extracting key terms (a named entity recognition task) as well as their relations (a relation extraction task) from the protocols using transformer pre-trained deep learning models. We compare several model configurations and report their results. Our NLP model achieves good performance considering the complex and unique nature of the language in real-world protocols, and has been integrated into the organization's protocol analytics practice. This approach and the extracted information will greatly facilitate trial feasibility analysis for developing new drugs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Clinical trial protocols (often called “study protocols”)
contain key information specifying the trial design and
implementation, but are usually in unstructured or semi-structured
format, which presents a huge challenge for running
computational analysis on them. Due to protocols’ critical role,
drug development businesses, such as contract research
organizations, have been devoting significant amount of
resources in analyzing study protocols to precisely understand
the operational requirements, comprehensively evaluate the
systemic challenges, unbiasedly assess the probability of
success, accurately forecast the cost implications for optimal
business planning. Currently, this protocol analysis work is
still performed in a labor-intensive fashion, involving
numerous resource checking and cross referencing works. To
develop safer, cheaper and more effective drugs faster for
better public health, this presses an urgent need for more
efficient and effective ways to process text-based protocols.</p>
      <p>Here, we present our efforts to facilitate the protocol
analysis workflow by automating the process of extracting key
information from the protocols using natural language
processing (NLP) techniques. More specifically, we focus on
the eligibility criteria section in the protocols, which
contains patient selection criteria information; we extract key
clinically relevant entities (i.e. named entities) and entity
relations (i.e. syntactic relations) from this section. Based on
the extracted information, the unstructured protocols can be
transformed into a structured network with interconnected
key entities (e.g. condition, drug, observation etc.) that can
be fed into various data-based analytic tasks, for example to
query against various real-world evidence databases for
patient population estimation, which is critical for clinical trial
design in drug development.</p>
      <p>Covance Inc. is the world’s largest provider for clinical
trial design, monitoring, managing and central lab testing
services, and has accumulated large volume of study
protocols. The presented work is our first step of a bigger
mission towards solving the protocol analysis challenge. To this
end, we employed the transfer learning strategy and
experiment with deep learning family of algorithms by using the
recently developed Bidirectional Encoder Representations
from Transformers (BERT) based models and fine-tuning
them on our in-house clinical trial protocol corpus to
identify the named entities and their relations.</p>
      <p>Study protocols are rigorous scientific documents with
highly domain-specific terms and complex relations. These
characteristics bring both benefits and challenges to NLP
work: we concern less about preprocessing due to its
rigorous use of language, but need to attend more to its unique
yet complex clinical terms and relations. A study protocol’s
eligibility criteria section is usually composed of two parts:
inclusion criteria and exclusion criteria, which respectively
describe the unambiguous characteristics of patients to be
included in and excluded from the clinical trial. The general
public can access some simplified protocol texts via
websites such as ClinicalTrials.gov, which already contain many
clinical terminologies. However, the real protocols are much
longer with even more domain-specific terms, thus more
difficult for the NLP task. We employ pre-trained BERT
transformers to tackle this challenging NLP task and our study
provides quantified evidence of how BERT performs in the
clinical trial domain.</p>
      <p>In our practice, the extracted information are stored in a
structured format. Figure 1 shows an example: the inclusion
criteria is represented as several key-value clauses such that
we can query a patient database to find the patients
satisfying these criteria. Through extraction we are essentially
connecting dots to build a larger graph for knowledge
engineering purpose, i.e. we connect protocol text to patient database
records, connect protocol to condition terms in a medical
ontology, and so on. Once the dots are properly connected,
we are empowered to perform many protocol analysis tasks
such as building a search engine for precise search,
composing graph networks for graph analysis for capturing missing
links, evaluating drug effectiveness by comparing with
similar drugs, clustering and recommending similar protocols
for study feasibility analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Named entities recognition (NER) and relation extraction
(RE) are two classical natural language processing (NLP)
tasks, which we carry out to extract entities and syntactic
relations respectively in our study. Previously, for NER,
researchers have mainly investigated probabilistic sequence
labeling models such like conditional random fields (CRF),
maximum entropy Markov models, and hidden Markov
model
        <xref ref-type="bibr" rid="ref14 ref23">(Lafferty, McCallum, and Pereira 2001; McCallum,
Freitag, and Pereira 2000; Bikel et al. 1998)</xref>
        . For RE, text
classification methods, such as support vector machine,
logistic regression, and perceptron, along with feature
engineering, have been used to assign relations between
entities
        <xref ref-type="bibr" rid="ref13 ref2">(Bach and Badaskar 2007; Jurafsky 2000)</xref>
        .
      </p>
      <p>
        In recent years, with the advances in deep neural network
methods, significant performance improvement has been
achieved for the NER and RE tasks. For NER tasks,
embeddings are widely used in neural network models to
represent words or characters as high-dimensional vectors.
Recurrent neural networks (RNN), including LSTM, GRU, and
their variants, are applied because their architectures
represent better the sentence context as well as the dynamic
sentence length in natural languages
        <xref ref-type="bibr" rid="ref11 ref22 ref32">(Huang, Xu, and Yu 2015;
Yang, Salakhutdinov, and Cohen 2016)</xref>
        . The Bidirectional
LSTM (Bi-LSTM) plus CRF network architecture has also
been widely used to achieve better NER performance
        <xref ref-type="bibr" rid="ref15 ref22">(Ma
and Hovy 2016; Lample et al. 2016)</xref>
        .
      </p>
      <p>
        Despite the improvement from previous models, RNN
and LSTM models tend to “forget” earlier context in long
sequences, which limits the model performance.
Transformers are subsequently proposed to counter this issue.
Transformer models use the attention mechanism that attends to
each word in a sequence by replacing the sequence-based
RNN style network structure with dot products and
multiplications between the key/value/query matrices projected
from the embedding vectors
        <xref ref-type="bibr" rid="ref30">(Vaswani et al. 2017)</xref>
        .
Transformers have the advantage of attending to every token in
a sequence, whether long or short, and therefore they can
capture associations between tokens that are even distantly
separated from each other. BERT models (Bidirectional
Encoder Representations from Transformers), a recent popular
NLP deep learning model, is a model employing multiple
layers of attentions and significantly improved NLP task
performance over previous models
        <xref ref-type="bibr" rid="ref8">(Devlin et al. 2018)</xref>
        .
      </p>
      <p>
        Additionally, transfer learning aims to transfer pre-trained
model from one task to another, usually by training a
general language model on general-domain data set and
transferring it to a downstream task by fine-tuning on the
taskspecific data set. A number of pre-trained language
models have been created to facilitate downstream tasks such
as NER and RE, examples including ELMO, ULMFit,
OpenAI GPT, and BERT, which have outperformed
previous baselines and some even achieved the state-of-the-art
performance
        <xref ref-type="bibr" rid="ref10 ref24 ref25">(Peters et al. 2018; Howard and Ruder 2018;
Radford et al. 2019)</xref>
        .
      </p>
      <p>
        Based on the original BERT architecture, a number of
BERT variants have emerged with alterations for
different purposes. For example, RoBERTa removes next
sentence prediction from the original loss function along with
some other hyperparameter changes; Transformer-XL
captures context both within and between segments for
tackling long-term dependency across sentences; and T5
advocates for encoding-decoding architecture, denoising
objectives and other changes based on extensive experiments
        <xref ref-type="bibr" rid="ref19 ref26 ref6">(Liu
et al. 2019; Dai et al. 2019; Raffel et al. 2019)</xref>
        .
      </p>
      <p>
        NER and RE have also been longstanding tasks in the
biomedical NLP domain. Researchers have investigated
applying similar yet more customized approaches to
biomedical texts, such as using CRF models and BiLSTM+CRF
neural networks
        <xref ref-type="bibr" rid="ref16 ref21 ref31">(Leaman and Gonzalez 2008; Lyu et al. 2017;
Wei et al. 2016)</xref>
        . With the introduction of the BERT model,
BERT based models have been adopted to the biomedical
domain by retraining it with biomedical corpus, among the
examples are BioBERT, SciBERT, and clinical BERT
        <xref ref-type="bibr" rid="ref1 ref18 ref3">(Lee
et al. 2019; Beltagy, Cohan, and Lo 2019; Alsentzer et al.
2019)</xref>
        .
      </p>
      <p>
        In the clinical informatics field, it is important to convert
unstructured criteria text to structured format because this
enables people to automatically parse a criteria and query for
proper patients against certain real-world evidence database.
Therefore, NER and RE algorithms are an appropriate and
natural fit to this practice: NER extracts concepts such as
conditions and observations that is related to a patient; RE
provides operational information such as the range for a
particular lab test result for patient selection. Criteria2Query is
a pioneering work in the space of translating study criteria
to SQL queries
        <xref ref-type="bibr" rid="ref33">(Yuan et al. 2019)</xref>
        . It relies mainly on CRF
sequence labeling for the NER task and SVM classification
for relation extraction. To the best of our knowledge, there
has been no research and practice to use pre-trained
transformer deep learning methods to extract structured
information from unstructured clinical trial protocols. Motivated by
the excellent performance of BERT based models on NER
and RE tasks in general domains, we experiment and
develop models and evaluate the performance in the clinical
trial domain.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Data Set</title>
        <p>To facilitate our NLP approach, we selected 470 study
protocols from Covance’s in-house protocol database. And our
protocol corpus comprises eligibility criteria sections from
these selected study protocols. An eligibility criteria section
typically contain 5 - 20 sentences that define the criteria to
select and recruit patients for the clinical study. Our data
contain a total of 30,183 criteria sentences.</p>
        <p>
          Data Annotation. We have the eligibility criteria
annotated using the IOB format
          <xref ref-type="bibr" rid="ref27">(Ramshaw and Marcus 1999)</xref>
          .
The corpus is annotated by well-trained biomedical domain
experts as the gold standard for training and testing. They
manually annotate the key clinical entities and their pairwise
relations if there exist any. We focus on 15 types of entities
and 7 types of relations that help clinically define a patient
cohort:
        </p>
        <p>Entities: Condition, Observation, Procedure, Device,
Drug, Investigational product, Event, Refractory condition,
Demographics, Measurement, Temporal constraints,
Qualifier/modifier, Anatomic location, Negation cue, Permission
cue</p>
        <p>Syntactic relations: Has value, Has temporal constraint,
Modified by, Located in, Is negated, Is permitted, Specified
by</p>
        <p>Data Split. For the NER task, we randomly split the
30,183 sentences into training (60%, 18,109 sentences) and
test (40%, 12,074 sentences) sets. For the RE task, before
splitting the data for training and testing, we first check
whether a sentence contains multiple relations and if so,
we duplicate the sentence for each pair of related entities
and make their relation type as the label for classification.
This results in 52,470 relation sample sentences, based on
which we perform a random split with stratification on
relation classes to derive training (60%, 31,482 relation
samples) and test sets (40%, 20,988 relation samples). Tables 1
and 2 show data statistics for the NER and RE tasks.</p>
      </sec>
      <sec id="sec-3-2">
        <title>NER Task</title>
        <p>
          As previously mentioned, we use NER algorithms to extract
clinically relevant entities in eligibility criteria section and
particularly choose BERT, a pre-trained transformer type of
deep learning model, because of its reported superior
performance in many NLP tasks. Due to the attention
transformer in BERT, it is able to provide dynamic context
embedding for tokens, which helps addressing the polysemy
issue. BERT is a language model pre-trained on a large
general domain corpus and can be applied towards downstream
tasks by adding simply structured task layers and fine
tuning on task-specific data set. We hereby follow the fine
tuning practice based on pre-trained models to derive our NER
model
          <xref ref-type="bibr" rid="ref18 ref8">(Devlin et al. 2018; Lee et al. 2019)</xref>
          . We explore
several options with regard to choice of pre-trained models and
task layers.
        </p>
        <p>
          NER task layers. The original BERT paper indicates that
when use for NER tasks, the pre-trained BERT model can
be simply followed by a softmax layer where each token is
classified to their most likely entity class without adding any
CRF layer
          <xref ref-type="bibr" rid="ref8">(Devlin et al. 2018)</xref>
          . However, our experiments
suggest that this approach sometimes fails to recognize
contiguous phrases as whole entities. To address this issue, we
further experiment the architecture with BiLSTM+CRF
layers as the NER task layer for its potentially better ability in
capturing bi-diretional context as well as tagging likelihood
at the sentence level (as opposed to token level).
        </p>
        <p>Cased or uncased. The BERT model provided by Google
includes versions with and without lowercasing
preprocessing on the tokens. We experiment with both the cased (not
applying lowercasing) and uncased (applying lowercasing)
options. Consequently, the two options use different set of
subword vocabularies, with cased model of 28,996 subwords
and uncased model of 30,522 subwords.</p>
        <p>Pre-trained models. We use BERT-base, a smaller
version of BERT that comprises 110 millions of parameters,
in our first set of experiments. BERT also has a larger
version, BERT-large, with 340 millions parameters. We opt to
use BERT-base for exploration purposes. In our second set
of experiments, we test the BioBERT model that is retrained
using large-scale biomedical texts on the basis of the
original BERT model. BioBERT has only a cased version and
shares the same vocabulary as BERT-base cased (with size
of 28,996).</p>
        <p>Hyperparameters. For both BERT-base and BioBERT
models, we set num of epochs=20, learning rate=2 10 5,
training batch size=32, max sequence length=32. For cases
when using BiLSTM+CRF as task layers, we set
bilstm layer size=128.</p>
        <p>
          The above model options result in 6 NER models:
BERTbase;uncased; Sof tmax: BERT base uncased
pretrained model, softmax as NER task layer
BERTbase;cased; Sof tmax: BERT base cased
pretrained model, softmax as NER task layer
BioBERT; Sof tmax: BioBERT pre-trained model
(cased), softmax as NER task layer
BERTbase;uncased; BiLST M + CRF : BERT
pretrained uncased model, BiLSTM+CRF as NER task layer
BERTbase;cased; BiLST M + CRF : BERT base
pretrained cased model, BiLSTM+CRF as NER task layer
BioBERT; BiLST M + CRF : BioBERT pre-trained
model (cased), BiLSTM+CRF as NER task layer
The layout of the BERT NER neural architecture is shown
in Figure 2.
The RE task is also treated as a downstream task to the
pretrained models. The original BERT paper did not include RE
task as one of their downstream tasks, whereas the BioBERT
study investigated it due to its importance in the biomedical
NLP domain
          <xref ref-type="bibr" rid="ref18">(Lee et al. 2019)</xref>
          . BioBERT handles relation
extraction as a classification task on the sentence or sequence
level. In particular, it assumes that each sentence contains at
most one relation and classifies whether a whole sentence,
instead of a particular pair of entities, contains a relation
of interest, e.g. Gene-disease relation. This approach is not
directly applicable to our data for 2 reasons: 1) our data
contain multiple types of relations, and 2) in our data set,
one sentence often contains multiple relations (52,470
relations/30,183 sentences = 1.7 relations/sentence on average).
        </p>
        <p>We employ the following strategy for the RE task: In
training, we first scan through each sentence for entities
using human annotations, and record the token positions of
each entities; if a sentence contains n (n &gt; 1) pairs of
entities with human annotated relation, we duplicate this
sentence n times so that each instance target represents one
pair of entities and their relation; In prediction, we use NER
pipeline results to locate entities, enumerate all legitimate
entity pairs, and duplicate sentences accordingly. Since we
record the token positions of each entity pair, we can get
BERT output vectors for them based on their position
information, concatenate the two vectors and then feed it to a
softmax layer to classify their relation. The result can be one
of the 7 relations listed in Table 2 or ‘no relation’.</p>
        <p>More specifically, the input fed to the BERT RE model
is sentence text along with positions of entity pairs. We do
not make use of entity type information for the following
reasons: 1) this end-to-end (i.e. tokens-to-relation) practice</p>
        <sec id="sec-3-2-1">
          <title>BERTbase;uncased;</title>
          <p>Sof tmax</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>BERTbase;cased;</title>
          <p>Sof tmax
BioBERT;
Sof tmax</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>BERTbase;uncased;</title>
          <p>BiLST M + CRF</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>BERTbase;cased;</title>
          <p>BiLST M + CRF
BioBERT;
BiLST M + CRF</p>
          <p>Type
strict
exact
partial
macro
strict
exact
partial
macro
strict
exact
partial
macro
strict
exact
partial
macro
strict
exact
partial
macro
strict
exact
partial
macro
makes the RE model more useful as a standalone tool that
does not require entity type; 2) when in prediction mode,
the errors in entity prediction could propagate to the RE task,
which we mitigate by including only the entity position
information. Figure 3 shows the neural architecture of our RE
task.</p>
          <p>For training purposes, we randomly generate negative
samples for the ‘no relation’ class as two entities can have
no relations with each other. We use two ways to obtain
negative samples: one way is to randomly choose two unrelated
entities in a sentence, the other is to break an existing related
entity pair and establish a non-related pair between one of
the entities in the original pair and another unrelated entity
in the sentence.</p>
          <p>Similar to the NER task, we experiment with 3 pre-trained
models with softmax as the task layer for all of them:
BERTbase;uncased: BERT base pre-trained model,
uncased
BERTbase;cased: BERT base pre-trained model, cased
BioBERT : BioBERT pre-trained model (cased)
Following hyperparameter configuration is
num of epochs=20, learning rate=2 10 5,
ing batch size=32, max sequence length=32.
used:
train</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and Analysis</title>
      <p>We implement the NER and RE tasks using Tensorflow
based on the BERT neural architecture and run experiments
on an AWS p2.xlarge GPU instance.</p>
      <sec id="sec-4-1">
        <title>NER Results</title>
        <p>We follow the practice in the SemEval-2013’s Drug-Drug
Interactions task and evaluate NER performance by 3
matching standards: strict, exact, and partial (Segura-Bedmar,
Mart´ınez, and Herrero-Zazo 2013). The strict matching
evaluates both boundary and entity type of entity phrases; the
exact matching evaluates the exact boundary regardless of
entity type; and the partial matching measures the partial
boundary of entities regardless of entity type (thus the most
lenient). We calculate precision(P)/recall(R)/f1-score(F) for
the three evaluation types, and additionally, we also report
macro average P/R/F results. The results are shown in Table
3.</p>
        <p>In our experiments, fine-tuning the pre-trained BioBERT
model achieves slightly better performance than its BERT
counterparts. For example, BioBERT; Sof tmax has
f1score of 70.61, better than BERTbase;uncased; Sof tmax’s
69.80 and BERTbase;cased; Sof tmax’s 69.68.
Similarly, BioBERT; BiLST M + CRF holds a higher
f1score than BERTbase;uncased; BiLST M + CRF and
BERTbase;cased; BiLST M + CRF for all the four
evaluation types.</p>
        <p>When comparing the cased and uncased strategies, we
notice that the uncased pre-trained models outperform
the cased ones with the same neural architecture: e.g.
BERTbase;uncased; BiLST M + CRF achieves f1-score
of 70.28 for the strict evaluation type, higher than the
f1score of 69.89 from BERTbase;cased; BiLST M + CRF .
F
69.80
73.16
77.55
64.63
69.68
73.15
77.49
64.63
70.61
73.83
78.06
65.03
70.28
73.62
77.98
64.88
69.89
73.23
77.56
64.45
70.76
74.06
78.36
65.54
This finding suggests that applying lowercase to
preprocessing actually enhances performance slightly, which is
counter-intuitive for NER tasks as the entities are often
casesensitive. Meanwhile, we also find that the two BioBERT
models, which are cased, perform better than their peer
models of the same neural architecture. But since BioBERT only
offers the cased option, we cannot discern the relative
contribution from being cased in the BioBERT pre-trained model.</p>
        <p>From Table 3, it is not surprising that for a given model,
the partial evaluation usually holds the highest score,
followed by exact, strict, and macro. Another observation is
that when we loosen evaluation type from strict to exact, i.e.
focusing on entity boundary without penalizing entity type
errors, the performance is improved but still remains in the
73.15-74.06 range, suggesting that the experimented BERT
based models fail identify entity boundary very precisely,
which can be of interest for future investigation.</p>
        <p>In our experiments with simple Softmax as the task layer,
we observe more boundary detection errors. This in fact
is the motivation for us to add the BiLSTM+CRF
layers as the NER task layer. However, the results show that
given the same pre-trained model configuration, it is
debatable that BiLSTM+CRF could consistently improve
performance. For example, BioBERT; BiLST M + CRF
slightly outperforms BioBERT; Sof tmax in strict
matching precision and f1-score, but BioBERT; Sof tmax beats
BioBERT; BiLST M + CRF in strict matching recall.</p>
        <p>We also find that the recall score is consistently higher
than the precision score for all models at all evaluation
stanType
micro
macro
weighted
micro
macro
weighted
micro
macro
weighted
dards, indicating that the models tend to have more false
positive predictions than false negative predictions. The macro
scores show lower performance than strict/exact/partial
because it simply averages the performance of different entity
types and some small-sample entity types have lower
performance due to lack of training data.</p>
        <p>Overall, BioBERT ; BiLST M + CRF produces the
best precision and f1-scores for all the four evaluation types
whereas BioBERT ; Sof tmax holds the highest recalls.
These results suggest that fine-tuning BioBERT lends itself
better to the NER tasks in the clinical trial domain, which
seems intuitive. But for task layer, the choice between
Softmax and BiLSTM+CRF does not significantly affect the
performance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>RE Results</title>
        <p>RE evaluation results are shown in Table 4, in which we
report micro/macro/weighted precision(P), recall(R), and
f1score(F).</p>
        <p>From the above performance chart, we find
that BERTbase;uncased has the highest f1-scores,
whereas BERTbase;cased has the lowest. Comparing
BERTbase;cased and BioBERT indicates that BioBERT
can help with performance slightly, at least for this cased
scenario. On the other hand, BERTbase;uncased noticeably
improves over its cased peer, BERTbase;cased, by a 4.33
percentage margin. Therefore, just like the NER task, the
RE task is also case insensitive, probably because uncased
situations reduce vocabulary variations in processing. We
also observe that recall and precision are close to each other
with precision slightly higher for the macro evaluation, but
on the contrary, precision is slightly higher than recall for
micro and weighted. These observations suggest that the
model has higher precision score than recall score in classes
with less samples, such as ‘is located’ and ‘is negated’
(in Table 1). And when doing macro evaluation, the
contribution from the smaller classes becomes more visible.</p>
        <p>Overall, the BERTbase;uncased model prevails - it
outperforms the other two models on each evaluation type and
measures. For example, it has f1-score of 78.79 for
micro, compared to BERTbase;cased’s 74.46 and BioBERT ’s
74.60. These results indicate again that the lowercasing
preprocessing helps the NLP tasks even in the clinical trial
domain where many terms are represented in capital
letters. Secondly, BioBERT beating BERTbase;cased with a
small margin may suggest that although pre-training in the
biomedical domain could bring in some benefit, it is still not
specific enough for clinical trials. Since there is no uncased
BioBERT pre-trained model available, it is unclear whether
training on biomedical corpus with lowercasing
preprocessing could synergistically improve the performance.
Considering the big improvement from BERTbase;cased to
BERTbase;uncased, we believe the uncased scenario of
current BioBERT model is worth future investigation.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Error Analysis</title>
        <p>
          We present and inspect NER prediction results from one of
the models (BERTbase;uncased; Sof tmax) in a Brat server,
an open source tool that can help visualize annotation results
using color bars
          <xref ref-type="bibr" rid="ref29">(Stenetorp et al. 2012)</xref>
          . We overlay human
and prediction annotations together in Brat to facilitate the
comparison.
        </p>
        <p>The NER errors can be broadly categorized into
boundary errors and entity type errors, as reflected by the four
evaluation types. For boundary errors, one pattern is that
BERT tends to mis-annotate some words inside a multi-word
phrase. For example, as shown in Figure 4, “at least a 3
month” is one temporal constraint entity, but the NER model
only captures “at”+“3 month” while misses the words in the
middle (“least a”). This reflects a potential problem with
BERT NER models: although it can assign entity classes
relatively well, lack of structure enforcement on its output
layer may possibly cause the inconsistent label within a full
phrase.</p>
        <p>In some cases, the NER model captures longer
entities than the human annotator. For example, the model
annotates “[cardiac mechanical assist device]jDevice”;
whereas the gold standard annotates the same phrase as
“[cardiac]jAnatomicLocation” + “[mechanical assist
device]jDevice”. In some other cases, the situation reverses
and the NER model chunks one entity in the gold
standard into multiple ones. For example, “[non-steroidal
anti-inflammatory drugs]jDrug” is chunked into a
Qualifier/Modifier and a drug: “[non-steroidal]jQualifier/Modifier
[anti-inflammatory drugs]jDrug”. The boundary merging
and chunking issues, as illustrated by these two examples,
occur frequently with the Qualifier/Modifier class as it is
arguable that a complex term can be annotated by one whole
entity or as a Qualifier/Modifier plus an entity.</p>
        <p>For the entity type error, we observe a few cases, such
as “urinalysis—Procedure type” is predicted as an
Observation entity, and “gastrointestinal motility—Condition type”
is predicted as Drug. The type errors occur less frequently
than boundary errors according to our manual inspection.</p>
        <p>For the RE task, we manually screen the predictions from
the BERTbase;uncased; Sof tmax model against the gold
standards. We first observe that the NER boundary errors
can propagate to the RE task. Note that we only use named
entity positions but not types in the RE task, and
therefore only NER boundary errors can affect the RE
performance. For example, “Transient neurologic deficits”,
annotated as one Condition entity in the gold standard, is
split into “Transient—Qualifier/Modifier” and “neurologic
deficits—Condition”, thus causing the RE task to predict a
‘modified by’ relation between the two entities which
actually does not exist in the gold standard. Another major
category of RE classification error is that a number of actual
relations misclassified as ’no relation’, while
misclassification between other classes is much less frequent.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this study, we focus on extracting clinically relevant terms
and relations from protocol eligibility criteria by applying
pre-trained transformer deep learning NLP models for NER
and RE tasks. We experiment with several configurations of
the pre-trained BERT models and report our results and
findings.</p>
      <p>Our results demonstrated the effectiveness of NLP
models in processing clinical trial protocols. Despite of the fact
that the processed texts are unique with specific clinical and
medical terms and logical relations, BERT and BioBERT
models returned acceptable performances. We also find that
in general, BioBERT, which is pre-trained on biomedical
corpus, outperforms BERT, which is pre-trained on general
domain corpus. This agrees with the general understanding
of the importance of domain-specific training for achieving
higher model performance in domain-specific tasks.</p>
      <p>A surprising finding is that even though the clinical trial
domain largely contains capitalized terminologies,
lowercasing preprocessing improves the performances of both
NER and RE tasks. Our hypothesis is that maintaining less
token variation (i.e. lowercasing has less variation) is more
important than maintaining casing for these tasks.</p>
      <p>It is also worth noting that there are rooms to improve
the quality of our gold standard. Due to the complex nature
of the protocols that cover many different sub-domains in
biomedical and clinical sciences such as therapeutic areas,
even human experts can easily make mistakes or be
inconsistent. In fact, we found many cases that the model predictions
are in fact correct, although different from the gold standard.
To address this annotation quality issue, we employed an
iterative annotating pipeline that asks human experts to verify
the pre-annotated documents by the NLP models. We
anticipat that this practice can help partly address this issue.</p>
      <p>We believe that the model performance can be
further improved. To do that, we can further explore in
several directions. The first approach is to train a
biomedical BERT model using a domain-specific vocabulary
from scratch. BERT model handles tokens by splitting
them into subwords using a predefined subword
vocabulary. For example, ’myocarditis’ and ’pericarditis’, two
heart conditions sharing the same suffix ’carditis’, are
however represented as ’my’+’##oca’+’##rdi’+’##tis’ and
’per’+’##ica’+’##rdi’+’##tis’ respectively. This way of
tokenization does not represent the suffix in a biomedically
meaningful way due to the lack of biomedical subwords in
the vocabulary. We assume subwords generated from the
biomedical domain reflecting word root patterns can
further enhance the word representation for BERT models and
thus improve downstream task performance. We can train a
BERT model from scratch using a biomedical corpus and a
biomedical subword vocabulary.</p>
      <p>The second strategy is to deploy multi-task co-training:
since NER and RE tasks are dependent on each other,
namely, knowing one task’s output can facilitate the other
task’s, and therefore joint learning on them is expected to
improve performances for both.</p>
      <p>Our third strategy for future improvement is to reduce
unnecessary relations currently predicted from the RE model.
Our current greedy prediction pipeline enumerates all
possible entity pairs that results in an unnecessarily large
testing base set. One way to address this issue is to consider
dependency parsing information, which can be used to
indicate whether two terms has dependency relations to prune
unnecessary entity pairs.</p>
      <p>The extracted information from the NER and RE tasks
has the great potential of assisting drug development
business especially for study feasibility analysis. The derived
information is the basis for a local knowledge graph for the
protocols and a global graph when merging with external
structured information such as drug ontologies. In
conclusion, this is our first step towards a greater mission to apply
deep learning to business cases in drug development, and
the subsequent analysis based on the derived graph can even
further enhance our contribution and insights to this research
area.</p>
      <p>Bidirectional
arXiv preprint</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Alsentzer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          ; Boag,
          <string-name>
            <given-names>W.</given-names>
            ; Weng, W.-H.;
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; Naumann,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>McDermott</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Publicly available clinical bert embeddings</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .03323.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bach</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Badaskar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>A review of relation extraction</article-title>
          .
          <source>Literature review for Language and Statistics II 2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Scibert: Pretrained contextualized embeddings for scientific text</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .10676.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          1998.
          <article-title>Nymble: a high-performance learning name-finder.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>arXiv preprint cmp-lg/9803003.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          , W. W.; Carbonell, J.;
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          ; and Salakhutdinov,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Transformer-xl: Attentive language models beyond a fixed-length context</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          arXiv preprint arXiv:
          <year>1901</year>
          .02860.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Universal language model fine-tuning for text classification</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .06146.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>arXiv:1508</source>
          .
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Speech &amp; language processing</article-title>
          .
          <source>Pearson Education India.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F. C.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>ICML proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Ballesteros,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Kawakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>arXiv preprint arXiv:1603</source>
          .
          <fpage>01360</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Leaman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Banner: an executable survey of advances in biomedical named entity recognition</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>In Biocomputing 2008. World Scientific</source>
          .
          <volume>652</volume>
          -
          <fpage>663</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Kim,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            ; and
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          .
          <source>Bioinformatics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Joshi,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ;
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ; and
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          arXiv preprint arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Lyu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Long short-term memory rnn for biomedical named entity recognition</article-title>
          .
          <source>BMC bioinformatics 18</source>
          (1):
          <fpage>462</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>End-to-end sequence labeling via bi-directional lstm-cnns-crf</article-title>
          .
          <source>arXiv preprint arXiv:1603</source>
          .
          <fpage>01354</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F. C.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Maximum entropy markov models for information extraction and segmentation</article-title>
          . In ICML, volume
          <volume>17</volume>
          ,
          <fpage>591</fpage>
          -
          <lpage>598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802</source>
          .05365.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Child,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Luan,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
          <source>OpenAI Blog</source>
          <volume>1</volume>
          (
          <issue>8</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Raffel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Matena,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          ; and Liu,
          <string-name>
            <surname>P. J.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          . arXiv e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Ramshaw</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Text chunking using transformation-based learning</article-title>
          .
          <source>In Natural language processing using very large corpora</source>
          . Springer.
          <fpage>157</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          2013.
          <article-title>Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts</article-title>
          (ddiextraction
          <year>2013</year>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Stenetorp</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pyysalo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Topic´, G.; Ohta,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Tsujii</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>brat: a web-based tool for NLPassisted text annotation</article-title>
          .
          <source>In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <fpage>102</fpage>
          -
          <lpage>107</lpage>
          . Avignon, France: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          , Ł.; and
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>5998</volume>
          -
          <fpage>6008</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; He,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks</article-title>
          .
          <source>Database</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Salakhutdinov</surname>
            , R.; and Cohen,
            <given-names>W.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Multitask cross-lingual sequence tagging from scratch</article-title>
          .
          <source>arXiv preprint arXiv:1603</source>
          .
          <fpage>06270</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ryan</surname>
            ,
            <given-names>P. B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ta</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hardin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Makadia,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Jin,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; et al.
          <year>2019</year>
          .
          <article-title>Criteria2query: a natural language interface to clinical databases for cohort definition</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>26</volume>
          (
          <issue>4</issue>
          ):
          <fpage>294</fpage>
          -
          <lpage>305</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>