=Paper=
{{Paper
|id=Vol-2600/paper4
|storemode=property
|title=Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis
|pdfUrl=https://ceur-ws.org/Vol-2600/paper4.pdf
|volume=Vol-2600
|authors=Miao Chen,Fang Du,Ganhui Lan,Victor Lobanov
|dblpUrl=https://dblp.org/rec/conf/aaaiss/ChenDLL20
}}
==Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis==
Using Pre-trained Transformer Deep Learning Models to Identify Named Entities
and Syntactic Relations for Clinical Protocol Analysis
Miao Chen, 1 Fang Du, 2 Ganhui Lan, 2 Victor Lobanov, 2
1
Covance, 8211 SciCor Drive, Indianapolis, IN, USA 2 Covance, 206 Carnegie Center, Princeton, NJ, USA
{miao.chen, fang.du, ganhui.lan, victor.lobanov}@covance.com
Abstract Here, we present our efforts to facilitate the protocol anal-
Transformer deep learning models, such as BERT, have
ysis workflow by automating the process of extracting key
demonstrated their effectiveness over previous baselines on information from the protocols using natural language pro-
a broad range of general-domain natural language processing cessing (NLP) techniques. More specifically, we focus on
(NLP) tasks such as classification, named entity recognition, the eligibility criteria section in the protocols, which con-
and question answering (Devlin et al. 2018). They also exhibit tains patient selection criteria information; we extract key
enhanced performance in domain-specific NLP tasks, includ- clinically relevant entities (i.e. named entities) and entity re-
ing BioNLP tasks (Lee et al. 2019; Alsentzer et al. 2019). In lations (i.e. syntactic relations) from this section. Based on
this study, we focus on clinical trial protocols: exploring and the extracted information, the unstructured protocols can be
extracting key terms (a named entity recognition task) as well transformed into a structured network with interconnected
as their relations (a relation extraction task) from the proto- key entities (e.g. condition, drug, observation etc.) that can
cols using transformer pre-trained deep learning models. We
compare several model configurations and report their results.
be fed into various data-based analytic tasks, for example to
Our NLP model achieves good performance considering the query against various real-world evidence databases for pa-
complex and unique nature of the language in real-world pro- tient population estimation, which is critical for clinical trial
tocols, and has been integrated into the organization’s pro- design in drug development.
tocol analytics practice. This approach and the extracted in- Covance Inc. is the world’s largest provider for clinical
formation will greatly facilitate trial feasibility analysis for trial design, monitoring, managing and central lab testing
developing new drugs. services, and has accumulated large volume of study proto-
cols. The presented work is our first step of a bigger mis-
Introduction sion towards solving the protocol analysis challenge. To this
Clinical trial protocols (often called “study protocols”) con- end, we employed the transfer learning strategy and exper-
tain key information specifying the trial design and imple- iment with deep learning family of algorithms by using the
mentation, but are usually in unstructured or semi-structured recently developed Bidirectional Encoder Representations
format, which presents a huge challenge for running com- from Transformers (BERT) based models and fine-tuning
putational analysis on them. Due to protocols’ critical role, them on our in-house clinical trial protocol corpus to iden-
drug development businesses, such as contract research or- tify the named entities and their relations.
ganizations, have been devoting significant amount of re- Study protocols are rigorous scientific documents with
sources in analyzing study protocols to precisely understand highly domain-specific terms and complex relations. These
the operational requirements, comprehensively evaluate the characteristics bring both benefits and challenges to NLP
systemic challenges, unbiasedly assess the probability of work: we concern less about preprocessing due to its rig-
success, accurately forecast the cost implications for optimal orous use of language, but need to attend more to its unique
business planning. Currently, this protocol analysis work is yet complex clinical terms and relations. A study protocol’s
still performed in a labor-intensive fashion, involving nu- eligibility criteria section is usually composed of two parts:
merous resource checking and cross referencing works. To inclusion criteria and exclusion criteria, which respectively
develop safer, cheaper and more effective drugs faster for describe the unambiguous characteristics of patients to be
better public health, this presses an urgent need for more ef- included in and excluded from the clinical trial. The general
ficient and effective ways to process text-based protocols. public can access some simplified protocol texts via web-
sites such as ClinicalTrials.gov, which already contain many
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
clinical terminologies. However, the real protocols are much
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- longer with even more domain-specific terms, thus more dif-
bining Machine Learning and Knowledge Engineering in Practice ficult for the NLP task. We employ pre-trained BERT trans-
(AAAI-MAKE 2020). Stanford University, Palo Alto, California, formers to tackle this challenging NLP task and our study
USA, March 23-25, 2020. Use permitted under Creative Commons provides quantified evidence of how BERT performs in the
License Attribution 4.0 International (CC BY 4.0). clinical trial domain.
and LSTM models tend to “forget” earlier context in long
sequences, which limits the model performance. Transform-
ers are subsequently proposed to counter this issue. Trans-
former models use the attention mechanism that attends to
each word in a sequence by replacing the sequence-based
RNN style network structure with dot products and multi-
plications between the key/value/query matrices projected
from the embedding vectors (Vaswani et al. 2017). Trans-
formers have the advantage of attending to every token in
a sequence, whether long or short, and therefore they can
capture associations between tokens that are even distantly
Figure 1: Structured information extracted from protocol el- separated from each other. BERT models (Bidirectional En-
igibility criteria. coder Representations from Transformers), a recent popular
NLP deep learning model, is a model employing multiple
layers of attentions and significantly improved NLP task per-
In our practice, the extracted information are stored in a formance over previous models (Devlin et al. 2018).
structured format. Figure 1 shows an example: the inclusion Additionally, transfer learning aims to transfer pre-trained
criteria is represented as several key-value clauses such that model from one task to another, usually by training a gen-
we can query a patient database to find the patients satisfy- eral language model on general-domain data set and trans-
ing these criteria. Through extraction we are essentially con- ferring it to a downstream task by fine-tuning on the task-
necting dots to build a larger graph for knowledge engineer- specific data set. A number of pre-trained language mod-
ing purpose, i.e. we connect protocol text to patient database els have been created to facilitate downstream tasks such
records, connect protocol to condition terms in a medical as NER and RE, examples including ELMO, ULMFit,
ontology, and so on. Once the dots are properly connected, OpenAI GPT, and BERT, which have outperformed previ-
we are empowered to perform many protocol analysis tasks ous baselines and some even achieved the state-of-the-art
such as building a search engine for precise search, compos- performance(Peters et al. 2018; Howard and Ruder 2018;
ing graph networks for graph analysis for capturing missing Radford et al. 2019).
links, evaluating drug effectiveness by comparing with sim- Based on the original BERT architecture, a number of
ilar drugs, clustering and recommending similar protocols BERT variants have emerged with alterations for differ-
for study feasibility analysis. ent purposes. For example, RoBERTa removes next sen-
tence prediction from the original loss function along with
Related Work some other hyperparameter changes; Transformer-XL cap-
Named entities recognition (NER) and relation extraction tures context both within and between segments for tack-
(RE) are two classical natural language processing (NLP) ling long-term dependency across sentences; and T5 advo-
tasks, which we carry out to extract entities and syntactic cates for encoding-decoding architecture, denoising objec-
relations respectively in our study. Previously, for NER, re- tives and other changes based on extensive experiments(Liu
searchers have mainly investigated probabilistic sequence et al. 2019; Dai et al. 2019; Raffel et al. 2019).
labeling models such like conditional random fields (CRF), NER and RE have also been longstanding tasks in the
maximum entropy Markov models, and hidden Markov biomedical NLP domain. Researchers have investigated ap-
model(Lafferty, McCallum, and Pereira 2001; McCallum, plying similar yet more customized approaches to biomedi-
Freitag, and Pereira 2000; Bikel et al. 1998). For RE, text cal texts, such as using CRF models and BiLSTM+CRF neu-
classification methods, such as support vector machine, lo- ral networks (Leaman and Gonzalez 2008; Lyu et al. 2017;
gistic regression, and perceptron, along with feature engi- Wei et al. 2016). With the introduction of the BERT model,
neering, have been used to assign relations between enti- BERT based models have been adopted to the biomedical
ties(Bach and Badaskar 2007; Jurafsky 2000). domain by retraining it with biomedical corpus, among the
In recent years, with the advances in deep neural network examples are BioBERT, SciBERT, and clinical BERT(Lee
methods, significant performance improvement has been et al. 2019; Beltagy, Cohan, and Lo 2019; Alsentzer et al.
achieved for the NER and RE tasks. For NER tasks, em- 2019).
beddings are widely used in neural network models to rep- In the clinical informatics field, it is important to convert
resent words or characters as high-dimensional vectors. Re- unstructured criteria text to structured format because this
current neural networks (RNN), including LSTM, GRU, and enables people to automatically parse a criteria and query for
their variants, are applied because their architectures repre- proper patients against certain real-world evidence database.
sent better the sentence context as well as the dynamic sen- Therefore, NER and RE algorithms are an appropriate and
tence length in natural languages (Huang, Xu, and Yu 2015; natural fit to this practice: NER extracts concepts such as
Yang, Salakhutdinov, and Cohen 2016). The Bidirectional conditions and observations that is related to a patient; RE
LSTM (Bi-LSTM) plus CRF network architecture has also provides operational information such as the range for a par-
been widely used to achieve better NER performance (Ma ticular lab test result for patient selection. Criteria2Query is
and Hovy 2016; Lample et al. 2016). a pioneering work in the space of translating study criteria
Despite the improvement from previous models, RNN to SQL queries(Yuan et al. 2019). It relies mainly on CRF
sequence labeling for the NER task and SVM classification
for relation extraction. To the best of our knowledge, there Table 1: Train and test data counts for the NER task.
Entity Train Test
has been no research and practice to use pre-trained trans-
Condition 12,682 8,537
former deep learning methods to extract structured informa-
Observation 7,309 5,218
tion from unstructured clinical trial protocols. Motivated by
Procedure 3,406 2,234
the excellent performance of BERT based models on NER
Device 221 140
and RE tasks in general domains, we experiment and de-
Drug 7,793 5,858
velop models and evaluate the performance in the clinical
Investigational product 329 224
trial domain.
Event 2,430 1,625
Refractory condition 381 278
Methodology Demographics 498 381
Data Set Measurement 4,540 3,344
To facilitate our NLP approach, we selected 470 study pro- Temporal constraints 6,968 4,589
tocols from Covance’s in-house protocol database. And our Qualifier/modifier 7,853 5,196
protocol corpus comprises eligibility criteria sections from Anatomic location 427 223
these selected study protocols. An eligibility criteria section Negation cue 921 615
typically contain 5 - 20 sentences that define the criteria to Permission cue 1,236 869
select and recruit patients for the clinical study. Our data
contain a total of 30,183 criteria sentences.
Data Annotation. We have the eligibility criteria anno- Table 2: Train and test data counts for the RE task.
tated using the IOB format (Ramshaw and Marcus 1999). Relation Train Test
The corpus is annotated by well-trained biomedical domain is negated 703 468
experts as the gold standard for training and testing. They is permitted 1,009 673
manually annotate the key clinical entities and their pairwise modified by 5,715 3,810
relations if there exist any. We focus on 15 types of entities has value 3,326 2,218
and 7 types of relations that help clinically define a patient has temporal constraint 6,169 4,112
cohort: is located 215 143
Entities: Condition, Observation, Procedure, Device, specified by 3,729 2,486
Drug, Investigational product, Event, Refractory condition, no relation 10,616 7,078
Demographics, Measurement, Temporal constraints, Quali- total count 31,482 20,988
fier/modifier, Anatomic location, Negation cue, Permission
cue
Syntactic relations: Has value, Has temporal constraint,
Modified by, Located in, Is negated, Is permitted, Specified tasks by adding simply structured task layers and fine tun-
by ing on task-specific data set. We hereby follow the fine tun-
Data Split. For the NER task, we randomly split the ing practice based on pre-trained models to derive our NER
30,183 sentences into training (60%, 18,109 sentences) and model (Devlin et al. 2018; Lee et al. 2019). We explore sev-
test (40%, 12,074 sentences) sets. For the RE task, before eral options with regard to choice of pre-trained models and
splitting the data for training and testing, we first check task layers.
whether a sentence contains multiple relations and if so, NER task layers. The original BERT paper indicates that
we duplicate the sentence for each pair of related entities when use for NER tasks, the pre-trained BERT model can
and make their relation type as the label for classification. be simply followed by a softmax layer where each token is
This results in 52,470 relation sample sentences, based on classified to their most likely entity class without adding any
which we perform a random split with stratification on re- CRF layer(Devlin et al. 2018). However, our experiments
lation classes to derive training (60%, 31,482 relation sam- suggest that this approach sometimes fails to recognize con-
ples) and test sets (40%, 20,988 relation samples). Tables 1 tiguous phrases as whole entities. To address this issue, we
and 2 show data statistics for the NER and RE tasks. further experiment the architecture with BiLSTM+CRF lay-
ers as the NER task layer for its potentially better ability in
NER Task capturing bi-diretional context as well as tagging likelihood
As previously mentioned, we use NER algorithms to extract at the sentence level (as opposed to token level).
clinically relevant entities in eligibility criteria section and Cased or uncased. The BERT model provided by Google
particularly choose BERT, a pre-trained transformer type of includes versions with and without lowercasing preprocess-
deep learning model, because of its reported superior per- ing on the tokens. We experiment with both the cased (not
formance in many NLP tasks. Due to the attention trans- applying lowercasing) and uncased (applying lowercasing)
former in BERT, it is able to provide dynamic context em- options. Consequently, the two options use different set of
bedding for tokens, which helps addressing the polysemy subword vocabularies, with cased model of 28,996 subwords
issue. BERT is a language model pre-trained on a large gen- and uncased model of 30,522 subwords.
eral domain corpus and can be applied towards downstream Pre-trained models. We use BERT-base, a smaller ver-
Figure 3: Neural architecture of the BERT RE task (with
Figure 2: Neural architecture of the BERT NER task (with
Softmax as the task layer).
Softmax as the task layer).
RE Task
sion of BERT that comprises 110 millions of parameters, The RE task is also treated as a downstream task to the pre-
in our first set of experiments. BERT also has a larger ver- trained models. The original BERT paper did not include RE
sion, BERT-large, with 340 millions parameters. We opt to task as one of their downstream tasks, whereas the BioBERT
use BERT-base for exploration purposes. In our second set study investigated it due to its importance in the biomedical
of experiments, we test the BioBERT model that is retrained NLP domain(Lee et al. 2019). BioBERT handles relation ex-
using large-scale biomedical texts on the basis of the orig- traction as a classification task on the sentence or sequence
inal BERT model. BioBERT has only a cased version and level. In particular, it assumes that each sentence contains at
shares the same vocabulary as BERT-base cased (with size most one relation and classifies whether a whole sentence,
of 28,996). instead of a particular pair of entities, contains a relation
Hyperparameters. For both BERT-base and BioBERT of interest, e.g. Gene-disease relation. This approach is not
models, we set num of epochs=20, learning rate=2 ∗ 10−5 , directly applicable to our data for 2 reasons: 1) our data
training batch size=32, max sequence length=32. For cases contain multiple types of relations, and 2) in our data set,
when using BiLSTM+CRF as task layers, we set bil- one sentence often contains multiple relations (52,470 rela-
stm layer size=128. tions/30,183 sentences = 1.7 relations/sentence on average).
We employ the following strategy for the RE task: In
The above model options result in 6 NER models: training, we first scan through each sentence for entities us-
ing human annotations, and record the token positions of
• BERTbase,uncased , Sof tmax: BERT base uncased pre- each entities; if a sentence contains n (n > 1) pairs of enti-
trained model, softmax as NER task layer ties with human annotated relation, we duplicate this sen-
• BERTbase,cased , Sof tmax: BERT base cased pre- tence n times so that each instance target represents one
trained model, softmax as NER task layer pair of entities and their relation; In prediction, we use NER
• BioBERT, Sof tmax: BioBERT pre-trained model pipeline results to locate entities, enumerate all legitimate
(cased), softmax as NER task layer entity pairs, and duplicate sentences accordingly. Since we
• BERTbase,uncased , BiLST M + CRF : BERT pre- record the token positions of each entity pair, we can get
trained uncased model, BiLSTM+CRF as NER task layer BERT output vectors for them based on their position in-
• BERTbase,cased , BiLST M + CRF : BERT base pre- formation, concatenate the two vectors and then feed it to a
trained cased model, BiLSTM+CRF as NER task layer softmax layer to classify their relation. The result can be one
• BioBERT, BiLST M + CRF : BioBERT pre-trained of the 7 relations listed in Table 2 or ‘no relation’.
model (cased), BiLSTM+CRF as NER task layer More specifically, the input fed to the BERT RE model
is sentence text along with positions of entity pairs. We do
The layout of the BERT NER neural architecture is shown not make use of entity type information for the following
in Figure 2. reasons: 1) this end-to-end (i.e. tokens-to-relation) practice
makes the RE model more useful as a standalone tool that
does not require entity type; 2) when in prediction mode, Table 3: NER task results: Precision(P), Recall(R), F1
the errors in entity prediction could propagate to the RE task, Score(F).
which we mitigate by including only the entity position in- NER Model Type P R F
formation. Figure 3 shows the neural architecture of our RE strict 67.76 71.98 69.80
task. BERTbase,uncased , exact 71.02 75.44 73.16
For training purposes, we randomly generate negative Sof tmax partial 75.28 79.96 77.55
samples for the ‘no relation’ class as two entities can have macro 62.65 66.83 64.63
no relations with each other. We use two ways to obtain neg- strict 67.82 71.66 69.68
ative samples: one way is to randomly choose two unrelated BERTbase,cased , exact 71.19 75.22 73.15
entities in a sentence, the other is to break an existing related Sof tmax partial 75.41 79.68 77.49
entity pair and establish a non-related pair between one of macro 63.04 66.37 64.63
the entities in the original pair and another unrelated entity strict 68.73 72.60 70.61
in the sentence. BioBERT, exact 71.87 75.91 73.83
Similar to the NER task, we experiment with 3 pre-trained Sof tmax partial 75.99 80.26 78.06
models with softmax as the task layer for all of them: macro 62.97 67.27 65.03
strict 68.59 72.06 70.28
• BERTbase,uncased : BERT base pre-trained model, un- BERTbase,uncased , exact 71.85 75.49 73.62
cased BiLST M + CRF partial 76.10 79.95 77.98
• BERTbase,cased : BERT base pre-trained model, cased macro 63.43 66.45 64.88
• BioBERT : BioBERT pre-trained model (cased) strict 68.09 71.80 69.89
Following hyperparameter configuration is used: BERTbase,cased , exact 71.34 75.22 73.23
num of epochs=20, learning rate=2 ∗ 10−5 , train- BiLST M + CRF partial 75.55 79.67 77.56
ing batch size=32, max sequence length=32. macro 62.68 66.41 64.45
strict 69.12 72.47 70.76
Results and Analysis BioBERT, exact 72.35 75.85 74.06
BiLST M + CRF partial 76.55 80.25 78.36
We implement the NER and RE tasks using Tensorflow macro 63.79 67.44 65.54
based on the BERT neural architecture and run experiments
on an AWS p2.xlarge GPU instance.
This finding suggests that applying lowercase to prepro-
NER Results cessing actually enhances performance slightly, which is
We follow the practice in the SemEval-2013’s Drug-Drug counter-intuitive for NER tasks as the entities are often case-
Interactions task and evaluate NER performance by 3 match- sensitive. Meanwhile, we also find that the two BioBERT
ing standards: strict, exact, and partial (Segura-Bedmar, models, which are cased, perform better than their peer mod-
Martı́nez, and Herrero-Zazo 2013). The strict matching eval- els of the same neural architecture. But since BioBERT only
uates both boundary and entity type of entity phrases; the offers the cased option, we cannot discern the relative contri-
exact matching evaluates the exact boundary regardless of bution from being cased in the BioBERT pre-trained model.
entity type; and the partial matching measures the partial From Table 3, it is not surprising that for a given model,
boundary of entities regardless of entity type (thus the most the partial evaluation usually holds the highest score, fol-
lenient). We calculate precision(P)/recall(R)/f1-score(F) for lowed by exact, strict, and macro. Another observation is
the three evaluation types, and additionally, we also report that when we loosen evaluation type from strict to exact, i.e.
macro average P/R/F results. The results are shown in Table focusing on entity boundary without penalizing entity type
3. errors, the performance is improved but still remains in the
In our experiments, fine-tuning the pre-trained BioBERT 73.15-74.06 range, suggesting that the experimented BERT
model achieves slightly better performance than its BERT based models fail identify entity boundary very precisely,
counterparts. For example, BioBERT, Sof tmax has f1- which can be of interest for future investigation.
score of 70.61, better than BERTbase,uncased , Sof tmax’s In our experiments with simple Softmax as the task layer,
69.80 and BERTbase,cased , Sof tmax’s 69.68. Simi- we observe more boundary detection errors. This in fact
larly, BioBERT, BiLST M + CRF holds a higher f1- is the motivation for us to add the BiLSTM+CRF lay-
score than BERTbase,uncased , BiLST M + CRF and ers as the NER task layer. However, the results show that
BERTbase,cased , BiLST M + CRF for all the four eval- given the same pre-trained model configuration, it is de-
uation types. batable that BiLSTM+CRF could consistently improve per-
When comparing the cased and uncased strategies, we formance. For example, BioBERT, BiLST M + CRF
notice that the uncased pre-trained models outperform slightly outperforms BioBERT, Sof tmax in strict match-
the cased ones with the same neural architecture: e.g. ing precision and f1-score, but BioBERT, Sof tmax beats
BERTbase,uncased , BiLST M + CRF achieves f1-score BioBERT, BiLST M + CRF in strict matching recall.
of 70.28 for the strict evaluation type, higher than the f1- We also find that the recall score is consistently higher
score of 69.89 from BERTbase,cased , BiLST M + CRF . than the precision score for all models at all evaluation stan-
dards, indicating that the models tend to have more false pos- domain where many terms are represented in capital let-
itive predictions than false negative predictions. The macro ters. Secondly, BioBERT beating BERTbase,cased with a
scores show lower performance than strict/exact/partial be- small margin may suggest that although pre-training in the
cause it simply averages the performance of different entity biomedical domain could bring in some benefit, it is still not
types and some small-sample entity types have lower perfor- specific enough for clinical trials. Since there is no uncased
mance due to lack of training data. BioBERT pre-trained model available, it is unclear whether
Overall, BioBERT, BiLST M + CRF produces the training on biomedical corpus with lowercasing preprocess-
best precision and f1-scores for all the four evaluation types ing could synergistically improve the performance. Con-
whereas BioBERT, Sof tmax holds the highest recalls. sidering the big improvement from BERTbase,cased to
These results suggest that fine-tuning BioBERT lends itself BERTbase,uncased , we believe the uncased scenario of cur-
better to the NER tasks in the clinical trial domain, which rent BioBERT model is worth future investigation.
seems intuitive. But for task layer, the choice between Soft-
max and BiLSTM+CRF does not significantly affect the per- Error Analysis
formance. We present and inspect NER prediction results from one of
the models (BERTbase,uncased , Sof tmax) in a Brat server,
RE Results an open source tool that can help visualize annotation results
RE evaluation results are shown in Table 4, in which we re- using color bars (Stenetorp et al. 2012). We overlay human
port micro/macro/weighted precision(P), recall(R), and f1- and prediction annotations together in Brat to facilitate the
score(F). comparison.
The NER errors can be broadly categorized into bound-
ary errors and entity type errors, as reflected by the four
Table 4: RE task results: Precision(P), Recall(R), F1 evaluation types. For boundary errors, one pattern is that
Score(F). BERT tends to mis-annotate some words inside a multi-word
RE Model Type P R F phrase. For example, as shown in Figure 4, “at least a 3
micro 78.10 79.49 78.79 month” is one temporal constraint entity, but the NER model
BERTbase,uncased macro 76.43 76.22 76.24 only captures “at”+“3 month” while misses the words in the
weighted 78.03 79.49 78.72 middle (“least a”). This reflects a potential problem with
micro 73.61 75.33 74.46 BERT NER models: although it can assign entity classes
BERTbase,cased macro 69.56 68.63 68.80 relatively well, lack of structure enforcement on its output
weighted 73.41 75.33 74.27 layer may possibly cause the inconsistent label within a full
micro 74.37 74.83 74.60 phrase.
BioBERT macro 70.30 68.34 69.08
weighted 74.17 74.83 74.44
From the above performance chart, we find
that BERTbase,uncased has the highest f1-scores,
whereas BERTbase,cased has the lowest. Comparing
BERTbase,cased and BioBERT indicates that BioBERT Figure 4: An example of the NER engine mis-annotating to-
can help with performance slightly, at least for this cased kens within a phrase.
scenario. On the other hand, BERTbase,uncased noticeably
improves over its cased peer, BERTbase,cased , by a 4.33 In some cases, the NER model captures longer enti-
percentage margin. Therefore, just like the NER task, the ties than the human annotator. For example, the model
RE task is also case insensitive, probably because uncased annotates “[cardiac mechanical assist device]|Device”;
situations reduce vocabulary variations in processing. We whereas the gold standard annotates the same phrase as
also observe that recall and precision are close to each other “[cardiac]|AnatomicLocation” + “[mechanical assist de-
with precision slightly higher for the macro evaluation, but vice]|Device”. In some other cases, the situation reverses
on the contrary, precision is slightly higher than recall for and the NER model chunks one entity in the gold stan-
micro and weighted. These observations suggest that the dard into multiple ones. For example, “[non-steroidal
model has higher precision score than recall score in classes anti-inflammatory drugs]|Drug” is chunked into a Quali-
with less samples, such as ‘is located’ and ‘is negated’ fier/Modifier and a drug: “[non-steroidal]|Qualifier/Modifier
(in Table 1). And when doing macro evaluation, the [anti-inflammatory drugs]|Drug”. The boundary merging
contribution from the smaller classes becomes more visible. and chunking issues, as illustrated by these two examples,
Overall, the BERTbase,uncased model prevails - it out- occur frequently with the Qualifier/Modifier class as it is ar-
performs the other two models on each evaluation type and guable that a complex term can be annotated by one whole
measures. For example, it has f1-score of 78.79 for mi- entity or as a Qualifier/Modifier plus an entity.
cro, compared to BERTbase,cased ’s 74.46 and BioBERT ’s For the entity type error, we observe a few cases, such
74.60. These results indicate again that the lowercasing pre- as “urinalysis—Procedure type” is predicted as an Observa-
processing helps the NLP tasks even in the clinical trial tion entity, and “gastrointestinal motility—Condition type”
is predicted as Drug. The type errors occur less frequently heart conditions sharing the same suffix ’carditis’, are
than boundary errors according to our manual inspection. however represented as ’my’+’##oca’+’##rdi’+’##tis’ and
For the RE task, we manually screen the predictions from ’per’+’##ica’+’##rdi’+’##tis’ respectively. This way of to-
the BERTbase,uncased , Sof tmax model against the gold kenization does not represent the suffix in a biomedically
standards. We first observe that the NER boundary errors meaningful way due to the lack of biomedical subwords in
can propagate to the RE task. Note that we only use named the vocabulary. We assume subwords generated from the
entity positions but not types in the RE task, and there- biomedical domain reflecting word root patterns can fur-
fore only NER boundary errors can affect the RE perfor- ther enhance the word representation for BERT models and
mance. For example, “Transient neurologic deficits”, an- thus improve downstream task performance. We can train a
notated as one Condition entity in the gold standard, is BERT model from scratch using a biomedical corpus and a
split into “Transient—Qualifier/Modifier” and “neurologic biomedical subword vocabulary.
deficits—Condition”, thus causing the RE task to predict a The second strategy is to deploy multi-task co-training:
‘modified by’ relation between the two entities which actu- since NER and RE tasks are dependent on each other,
ally does not exist in the gold standard. Another major cat- namely, knowing one task’s output can facilitate the other
egory of RE classification error is that a number of actual task’s, and therefore joint learning on them is expected to
relations misclassified as ’no relation’, while misclassifica- improve performances for both.
tion between other classes is much less frequent. Our third strategy for future improvement is to reduce un-
necessary relations currently predicted from the RE model.
Conclusion and Future Work Our current greedy prediction pipeline enumerates all pos-
In this study, we focus on extracting clinically relevant terms sible entity pairs that results in an unnecessarily large test-
and relations from protocol eligibility criteria by applying ing base set. One way to address this issue is to consider
pre-trained transformer deep learning NLP models for NER dependency parsing information, which can be used to in-
and RE tasks. We experiment with several configurations of dicate whether two terms has dependency relations to prune
the pre-trained BERT models and report our results and find- unnecessary entity pairs.
ings. The extracted information from the NER and RE tasks
Our results demonstrated the effectiveness of NLP mod- has the great potential of assisting drug development busi-
els in processing clinical trial protocols. Despite of the fact ness especially for study feasibility analysis. The derived in-
that the processed texts are unique with specific clinical and formation is the basis for a local knowledge graph for the
medical terms and logical relations, BERT and BioBERT protocols and a global graph when merging with external
models returned acceptable performances. We also find that structured information such as drug ontologies. In conclu-
in general, BioBERT, which is pre-trained on biomedical sion, this is our first step towards a greater mission to apply
corpus, outperforms BERT, which is pre-trained on general deep learning to business cases in drug development, and
domain corpus. This agrees with the general understanding the subsequent analysis based on the derived graph can even
of the importance of domain-specific training for achieving further enhance our contribution and insights to this research
higher model performance in domain-specific tasks. area.
A surprising finding is that even though the clinical trial
domain largely contains capitalized terminologies, lower-
casing preprocessing improves the performances of both References
NER and RE tasks. Our hypothesis is that maintaining less Alsentzer, E.; Murphy, J. R.; Boag, W.; Weng, W.-H.;
token variation (i.e. lowercasing has less variation) is more Jin, D.; Naumann, T.; and McDermott, M. 2019. Pub-
important than maintaining casing for these tasks. licly available clinical bert embeddings. arXiv preprint
It is also worth noting that there are rooms to improve arXiv:1904.03323.
the quality of our gold standard. Due to the complex nature
Bach, N., and Badaskar, S. 2007. A review of relation ex-
of the protocols that cover many different sub-domains in
traction. Literature review for Language and Statistics II 2.
biomedical and clinical sciences such as therapeutic areas,
even human experts can easily make mistakes or be inconsis- Beltagy, I.; Cohan, A.; and Lo, K. 2019. Scibert: Pretrained
tent. In fact, we found many cases that the model predictions contextualized embeddings for scientific text. arXiv preprint
are in fact correct, although different from the gold standard. arXiv:1903.10676.
To address this annotation quality issue, we employed an it- Bikel, D. M.; Miller, S.; Schwartz, R.; and Weischedel, R.
erative annotating pipeline that asks human experts to verify 1998. Nymble: a high-performance learning name-finder.
the pre-annotated documents by the NLP models. We antic- arXiv preprint cmp-lg/9803003.
ipat that this practice can help partly address this issue.
We believe that the model performance can be fur- Dai, Z.; Yang, Z.; Yang, Y.; Cohen, W. W.; Carbonell, J.;
ther improved. To do that, we can further explore in sev- Le, Q. V.; and Salakhutdinov, R. 2019. Transformer-xl:
eral directions. The first approach is to train a biomed- Attentive language models beyond a fixed-length context.
ical BERT model using a domain-specific vocabulary arXiv preprint arXiv:1901.02860.
from scratch. BERT model handles tokens by splitting Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
them into subwords using a predefined subword vocab- Bert: Pre-training of deep bidirectional transformers for lan-
ulary. For example, ’myocarditis’ and ’pericarditis’, two guage understanding. arXiv preprint arXiv:1810.04805.
Howard, J., and Ruder, S. 2018. Universal language Association for Computational Linguistics, 102–107. Avi-
model fine-tuning for text classification. arXiv preprint gnon, France: Association for Computational Linguistics.
arXiv:1801.06146. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
lstm-crf models for sequence tagging. arXiv preprint tention is all you need. In Advances in neural information
arXiv:1508.01991. processing systems, 5998–6008.
Jurafsky, D. 2000. Speech & language processing. Pearson Wei, Q.; Chen, T.; Xu, R.; He, Y.; and Gui, L. 2016. Disease
Education India. named entity recognition by combining conditional random
Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Con- fields and bidirectional recurrent neural networks. Database
ditional random fields: Probabilistic models for segmenting 2016.
and labeling sequence data. ICML proceedings. Yang, Z.; Salakhutdinov, R.; and Cohen, W. 2016. Multi-
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, task cross-lingual sequence tagging from scratch. arXiv
K.; and Dyer, C. 2016. Neural architectures for named entity preprint arXiv:1603.06270.
recognition. arXiv preprint arXiv:1603.01360. Yuan, C.; Ryan, P. B.; Ta, C.; Guo, Y.; Li, Z.; Hardin, J.;
Leaman, R., and Gonzalez, G. 2008. Banner: an executable Makadia, R.; Jin, P.; Shang, N.; Kang, T.; et al. 2019. Crite-
survey of advances in biomedical named entity recognition. ria2query: a natural language interface to clinical databases
In Biocomputing 2008. World Scientific. 652–663. for cohort definition. Journal of the American Medical In-
formatics Association 26(4):294–305.
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C. H.; and
Kang, J. 2019. Biobert: a pre-trained biomedical language
representation model for biomedical text mining. Bioinfor-
matics.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy,
O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019.
Roberta: A robustly optimized bert pretraining approach.
arXiv preprint arXiv:1907.11692.
Lyu, C.; Chen, B.; Ren, Y.; and Ji, D. 2017. Long short-term
memory rnn for biomedical named entity recognition. BMC
bioinformatics 18(1):462.
Ma, X., and Hovy, E. 2016. End-to-end sequence la-
beling via bi-directional lstm-cnns-crf. arXiv preprint
arXiv:1603.01354.
McCallum, A.; Freitag, D.; and Pereira, F. C. 2000. Maxi-
mum entropy markov models for information extraction and
segmentation. In ICML, volume 17, 591–598.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
word representations. arXiv preprint arXiv:1802.05365.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
Sutskever, I. 2019. Language models are unsupervised mul-
titask learners. OpenAI Blog 1(8).
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.;
Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Explor-
ing the limits of transfer learning with a unified text-to-text
transformer. arXiv e-prints.
Ramshaw, L. A., and Marcus, M. P. 1999. Text chunking
using transformation-based learning. In Natural language
processing using very large corpora. Springer. 157–176.
Segura-Bedmar, I.; Martı́nez, P.; and Herrero-Zazo, M.
2013. Semeval-2013 task 9: Extraction of drug-drug inter-
actions from biomedical texts (ddiextraction 2013). Associ-
ation for Computational Linguistics.
Stenetorp, P.; Pyysalo, S.; Topić, G.; Ohta, T.; Ananiadou,
S.; and Tsujii, J. 2012. brat: a web-based tool for NLP-
assisted text annotation. In Proceedings of the Demonstra-
tions at the 13th Conference of the European Chapter of the