=Paper= {{Paper |id=Vol-2600/paper4 |storemode=property |title=Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis |pdfUrl=https://ceur-ws.org/Vol-2600/paper4.pdf |volume=Vol-2600 |authors=Miao Chen,Fang Du,Ganhui Lan,Victor Lobanov |dblpUrl=https://dblp.org/rec/conf/aaaiss/ChenDLL20 }} ==Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis== https://ceur-ws.org/Vol-2600/paper4.pdf
Using Pre-trained Transformer Deep Learning Models to Identify Named Entities
              and Syntactic Relations for Clinical Protocol Analysis

                             Miao Chen, 1 Fang Du, 2 Ganhui Lan, 2 Victor Lobanov, 2
          1
              Covance, 8211 SciCor Drive, Indianapolis, IN, USA 2 Covance, 206 Carnegie Center, Princeton, NJ, USA
                                {miao.chen, fang.du, ganhui.lan, victor.lobanov}@covance.com



                            Abstract                                    Here, we present our efforts to facilitate the protocol anal-
  Transformer deep learning models, such as BERT, have
                                                                     ysis workflow by automating the process of extracting key
  demonstrated their effectiveness over previous baselines on        information from the protocols using natural language pro-
  a broad range of general-domain natural language processing        cessing (NLP) techniques. More specifically, we focus on
  (NLP) tasks such as classification, named entity recognition,      the eligibility criteria section in the protocols, which con-
  and question answering (Devlin et al. 2018). They also exhibit     tains patient selection criteria information; we extract key
  enhanced performance in domain-specific NLP tasks, includ-         clinically relevant entities (i.e. named entities) and entity re-
  ing BioNLP tasks (Lee et al. 2019; Alsentzer et al. 2019). In      lations (i.e. syntactic relations) from this section. Based on
  this study, we focus on clinical trial protocols: exploring and    the extracted information, the unstructured protocols can be
  extracting key terms (a named entity recognition task) as well     transformed into a structured network with interconnected
  as their relations (a relation extraction task) from the proto-    key entities (e.g. condition, drug, observation etc.) that can
  cols using transformer pre-trained deep learning models. We
  compare several model configurations and report their results.
                                                                     be fed into various data-based analytic tasks, for example to
  Our NLP model achieves good performance considering the            query against various real-world evidence databases for pa-
  complex and unique nature of the language in real-world pro-       tient population estimation, which is critical for clinical trial
  tocols, and has been integrated into the organization’s pro-       design in drug development.
  tocol analytics practice. This approach and the extracted in-         Covance Inc. is the world’s largest provider for clinical
  formation will greatly facilitate trial feasibility analysis for   trial design, monitoring, managing and central lab testing
  developing new drugs.                                              services, and has accumulated large volume of study proto-
                                                                     cols. The presented work is our first step of a bigger mis-
                        Introduction                                 sion towards solving the protocol analysis challenge. To this
Clinical trial protocols (often called “study protocols”) con-       end, we employed the transfer learning strategy and exper-
tain key information specifying the trial design and imple-          iment with deep learning family of algorithms by using the
mentation, but are usually in unstructured or semi-structured        recently developed Bidirectional Encoder Representations
format, which presents a huge challenge for running com-             from Transformers (BERT) based models and fine-tuning
putational analysis on them. Due to protocols’ critical role,        them on our in-house clinical trial protocol corpus to iden-
drug development businesses, such as contract research or-           tify the named entities and their relations.
ganizations, have been devoting significant amount of re-               Study protocols are rigorous scientific documents with
sources in analyzing study protocols to precisely understand         highly domain-specific terms and complex relations. These
the operational requirements, comprehensively evaluate the           characteristics bring both benefits and challenges to NLP
systemic challenges, unbiasedly assess the probability of            work: we concern less about preprocessing due to its rig-
success, accurately forecast the cost implications for optimal       orous use of language, but need to attend more to its unique
business planning. Currently, this protocol analysis work is         yet complex clinical terms and relations. A study protocol’s
still performed in a labor-intensive fashion, involving nu-          eligibility criteria section is usually composed of two parts:
merous resource checking and cross referencing works. To             inclusion criteria and exclusion criteria, which respectively
develop safer, cheaper and more effective drugs faster for           describe the unambiguous characteristics of patients to be
better public health, this presses an urgent need for more ef-       included in and excluded from the clinical trial. The general
ficient and effective ways to process text-based protocols.          public can access some simplified protocol texts via web-
                                                                     sites such as ClinicalTrials.gov, which already contain many
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
                                                                     clinical terminologies. However, the real protocols are much
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-        longer with even more domain-specific terms, thus more dif-
bining Machine Learning and Knowledge Engineering in Practice        ficult for the NLP task. We employ pre-trained BERT trans-
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,        formers to tackle this challenging NLP task and our study
USA, March 23-25, 2020. Use permitted under Creative Commons         provides quantified evidence of how BERT performs in the
License Attribution 4.0 International (CC BY 4.0).                   clinical trial domain.
                                                                 and LSTM models tend to “forget” earlier context in long
                                                                 sequences, which limits the model performance. Transform-
                                                                 ers are subsequently proposed to counter this issue. Trans-
                                                                 former models use the attention mechanism that attends to
                                                                 each word in a sequence by replacing the sequence-based
                                                                 RNN style network structure with dot products and multi-
                                                                 plications between the key/value/query matrices projected
                                                                 from the embedding vectors (Vaswani et al. 2017). Trans-
                                                                 formers have the advantage of attending to every token in
                                                                 a sequence, whether long or short, and therefore they can
                                                                 capture associations between tokens that are even distantly
Figure 1: Structured information extracted from protocol el-     separated from each other. BERT models (Bidirectional En-
igibility criteria.                                              coder Representations from Transformers), a recent popular
                                                                 NLP deep learning model, is a model employing multiple
                                                                 layers of attentions and significantly improved NLP task per-
   In our practice, the extracted information are stored in a    formance over previous models (Devlin et al. 2018).
structured format. Figure 1 shows an example: the inclusion         Additionally, transfer learning aims to transfer pre-trained
criteria is represented as several key-value clauses such that   model from one task to another, usually by training a gen-
we can query a patient database to find the patients satisfy-    eral language model on general-domain data set and trans-
ing these criteria. Through extraction we are essentially con-   ferring it to a downstream task by fine-tuning on the task-
necting dots to build a larger graph for knowledge engineer-     specific data set. A number of pre-trained language mod-
ing purpose, i.e. we connect protocol text to patient database   els have been created to facilitate downstream tasks such
records, connect protocol to condition terms in a medical        as NER and RE, examples including ELMO, ULMFit,
ontology, and so on. Once the dots are properly connected,       OpenAI GPT, and BERT, which have outperformed previ-
we are empowered to perform many protocol analysis tasks         ous baselines and some even achieved the state-of-the-art
such as building a search engine for precise search, compos-     performance(Peters et al. 2018; Howard and Ruder 2018;
ing graph networks for graph analysis for capturing missing      Radford et al. 2019).
links, evaluating drug effectiveness by comparing with sim-         Based on the original BERT architecture, a number of
ilar drugs, clustering and recommending similar protocols        BERT variants have emerged with alterations for differ-
for study feasibility analysis.                                  ent purposes. For example, RoBERTa removes next sen-
                                                                 tence prediction from the original loss function along with
                     Related Work                                some other hyperparameter changes; Transformer-XL cap-
Named entities recognition (NER) and relation extraction         tures context both within and between segments for tack-
(RE) are two classical natural language processing (NLP)         ling long-term dependency across sentences; and T5 advo-
tasks, which we carry out to extract entities and syntactic      cates for encoding-decoding architecture, denoising objec-
relations respectively in our study. Previously, for NER, re-    tives and other changes based on extensive experiments(Liu
searchers have mainly investigated probabilistic sequence        et al. 2019; Dai et al. 2019; Raffel et al. 2019).
labeling models such like conditional random fields (CRF),          NER and RE have also been longstanding tasks in the
maximum entropy Markov models, and hidden Markov                 biomedical NLP domain. Researchers have investigated ap-
model(Lafferty, McCallum, and Pereira 2001; McCallum,            plying similar yet more customized approaches to biomedi-
Freitag, and Pereira 2000; Bikel et al. 1998). For RE, text      cal texts, such as using CRF models and BiLSTM+CRF neu-
classification methods, such as support vector machine, lo-      ral networks (Leaman and Gonzalez 2008; Lyu et al. 2017;
gistic regression, and perceptron, along with feature engi-      Wei et al. 2016). With the introduction of the BERT model,
neering, have been used to assign relations between enti-        BERT based models have been adopted to the biomedical
ties(Bach and Badaskar 2007; Jurafsky 2000).                     domain by retraining it with biomedical corpus, among the
   In recent years, with the advances in deep neural network     examples are BioBERT, SciBERT, and clinical BERT(Lee
methods, significant performance improvement has been            et al. 2019; Beltagy, Cohan, and Lo 2019; Alsentzer et al.
achieved for the NER and RE tasks. For NER tasks, em-            2019).
beddings are widely used in neural network models to rep-           In the clinical informatics field, it is important to convert
resent words or characters as high-dimensional vectors. Re-      unstructured criteria text to structured format because this
current neural networks (RNN), including LSTM, GRU, and          enables people to automatically parse a criteria and query for
their variants, are applied because their architectures repre-   proper patients against certain real-world evidence database.
sent better the sentence context as well as the dynamic sen-     Therefore, NER and RE algorithms are an appropriate and
tence length in natural languages (Huang, Xu, and Yu 2015;       natural fit to this practice: NER extracts concepts such as
Yang, Salakhutdinov, and Cohen 2016). The Bidirectional          conditions and observations that is related to a patient; RE
LSTM (Bi-LSTM) plus CRF network architecture has also            provides operational information such as the range for a par-
been widely used to achieve better NER performance (Ma           ticular lab test result for patient selection. Criteria2Query is
and Hovy 2016; Lample et al. 2016).                              a pioneering work in the space of translating study criteria
   Despite the improvement from previous models, RNN             to SQL queries(Yuan et al. 2019). It relies mainly on CRF
sequence labeling for the NER task and SVM classification
for relation extraction. To the best of our knowledge, there          Table 1: Train and test data counts for the NER task.
                                                                           Entity                     Train      Test
has been no research and practice to use pre-trained trans-
                                                                           Condition                  12,682 8,537
former deep learning methods to extract structured informa-
                                                                           Observation                7,309      5,218
tion from unstructured clinical trial protocols. Motivated by
                                                                           Procedure                  3,406      2,234
the excellent performance of BERT based models on NER
                                                                           Device                     221        140
and RE tasks in general domains, we experiment and de-
                                                                           Drug                       7,793      5,858
velop models and evaluate the performance in the clinical
                                                                           Investigational product 329           224
trial domain.
                                                                           Event                      2,430      1,625
                                                                           Refractory condition       381        278
                      Methodology                                          Demographics               498        381
Data Set                                                                   Measurement                4,540      3,344
To facilitate our NLP approach, we selected 470 study pro-                 Temporal constraints       6,968      4,589
tocols from Covance’s in-house protocol database. And our                  Qualifier/modifier         7,853      5,196
protocol corpus comprises eligibility criteria sections from               Anatomic location          427        223
these selected study protocols. An eligibility criteria section            Negation cue               921        615
typically contain 5 - 20 sentences that define the criteria to             Permission cue             1,236      869
select and recruit patients for the clinical study. Our data
contain a total of 30,183 criteria sentences.
   Data Annotation. We have the eligibility criteria anno-             Table 2: Train and test data counts for the RE task.
tated using the IOB format (Ramshaw and Marcus 1999).                     Relation                    Train      Test
The corpus is annotated by well-trained biomedical domain                 is negated                  703        468
experts as the gold standard for training and testing. They               is permitted                1,009      673
manually annotate the key clinical entities and their pairwise            modified by                 5,715      3,810
relations if there exist any. We focus on 15 types of entities            has value                   3,326      2,218
and 7 types of relations that help clinically define a patient            has temporal constraint 6,169          4,112
cohort:                                                                   is located                  215        143
   Entities: Condition, Observation, Procedure, Device,                   specified by                3,729      2,486
Drug, Investigational product, Event, Refractory condition,               no relation                 10,616 7,078
Demographics, Measurement, Temporal constraints, Quali-                   total count                 31,482 20,988
fier/modifier, Anatomic location, Negation cue, Permission
cue
   Syntactic relations: Has value, Has temporal constraint,
Modified by, Located in, Is negated, Is permitted, Specified       tasks by adding simply structured task layers and fine tun-
by                                                                 ing on task-specific data set. We hereby follow the fine tun-
   Data Split. For the NER task, we randomly split the             ing practice based on pre-trained models to derive our NER
30,183 sentences into training (60%, 18,109 sentences) and         model (Devlin et al. 2018; Lee et al. 2019). We explore sev-
test (40%, 12,074 sentences) sets. For the RE task, before         eral options with regard to choice of pre-trained models and
splitting the data for training and testing, we first check        task layers.
whether a sentence contains multiple relations and if so,             NER task layers. The original BERT paper indicates that
we duplicate the sentence for each pair of related entities        when use for NER tasks, the pre-trained BERT model can
and make their relation type as the label for classification.      be simply followed by a softmax layer where each token is
This results in 52,470 relation sample sentences, based on         classified to their most likely entity class without adding any
which we perform a random split with stratification on re-         CRF layer(Devlin et al. 2018). However, our experiments
lation classes to derive training (60%, 31,482 relation sam-       suggest that this approach sometimes fails to recognize con-
ples) and test sets (40%, 20,988 relation samples). Tables 1       tiguous phrases as whole entities. To address this issue, we
and 2 show data statistics for the NER and RE tasks.               further experiment the architecture with BiLSTM+CRF lay-
                                                                   ers as the NER task layer for its potentially better ability in
NER Task                                                           capturing bi-diretional context as well as tagging likelihood
As previously mentioned, we use NER algorithms to extract          at the sentence level (as opposed to token level).
clinically relevant entities in eligibility criteria section and      Cased or uncased. The BERT model provided by Google
particularly choose BERT, a pre-trained transformer type of        includes versions with and without lowercasing preprocess-
deep learning model, because of its reported superior per-         ing on the tokens. We experiment with both the cased (not
formance in many NLP tasks. Due to the attention trans-            applying lowercasing) and uncased (applying lowercasing)
former in BERT, it is able to provide dynamic context em-          options. Consequently, the two options use different set of
bedding for tokens, which helps addressing the polysemy            subword vocabularies, with cased model of 28,996 subwords
issue. BERT is a language model pre-trained on a large gen-        and uncased model of 30,522 subwords.
eral domain corpus and can be applied towards downstream              Pre-trained models. We use BERT-base, a smaller ver-
                                                               Figure 3: Neural architecture of the BERT RE task (with
Figure 2: Neural architecture of the BERT NER task (with
                                                               Softmax as the task layer).
Softmax as the task layer).

                                                               RE Task
sion of BERT that comprises 110 millions of parameters,        The RE task is also treated as a downstream task to the pre-
in our first set of experiments. BERT also has a larger ver-   trained models. The original BERT paper did not include RE
sion, BERT-large, with 340 millions parameters. We opt to      task as one of their downstream tasks, whereas the BioBERT
use BERT-base for exploration purposes. In our second set      study investigated it due to its importance in the biomedical
of experiments, we test the BioBERT model that is retrained    NLP domain(Lee et al. 2019). BioBERT handles relation ex-
using large-scale biomedical texts on the basis of the orig-   traction as a classification task on the sentence or sequence
inal BERT model. BioBERT has only a cased version and          level. In particular, it assumes that each sentence contains at
shares the same vocabulary as BERT-base cased (with size       most one relation and classifies whether a whole sentence,
of 28,996).                                                    instead of a particular pair of entities, contains a relation
   Hyperparameters. For both BERT-base and BioBERT             of interest, e.g. Gene-disease relation. This approach is not
models, we set num of epochs=20, learning rate=2 ∗ 10−5 ,      directly applicable to our data for 2 reasons: 1) our data
training batch size=32, max sequence length=32. For cases      contain multiple types of relations, and 2) in our data set,
when using BiLSTM+CRF as task layers, we set bil-              one sentence often contains multiple relations (52,470 rela-
stm layer size=128.                                            tions/30,183 sentences = 1.7 relations/sentence on average).
                                                                  We employ the following strategy for the RE task: In
  The above model options result in 6 NER models:              training, we first scan through each sentence for entities us-
                                                               ing human annotations, and record the token positions of
• BERTbase,uncased , Sof tmax: BERT base uncased pre-          each entities; if a sentence contains n (n > 1) pairs of enti-
  trained model, softmax as NER task layer                     ties with human annotated relation, we duplicate this sen-
• BERTbase,cased , Sof tmax: BERT base cased pre-              tence n times so that each instance target represents one
  trained model, softmax as NER task layer                     pair of entities and their relation; In prediction, we use NER
• BioBERT, Sof tmax: BioBERT pre-trained model                 pipeline results to locate entities, enumerate all legitimate
  (cased), softmax as NER task layer                           entity pairs, and duplicate sentences accordingly. Since we
• BERTbase,uncased , BiLST M + CRF : BERT pre-                 record the token positions of each entity pair, we can get
  trained uncased model, BiLSTM+CRF as NER task layer          BERT output vectors for them based on their position in-
• BERTbase,cased , BiLST M + CRF : BERT base pre-              formation, concatenate the two vectors and then feed it to a
  trained cased model, BiLSTM+CRF as NER task layer            softmax layer to classify their relation. The result can be one
• BioBERT, BiLST M + CRF : BioBERT pre-trained                 of the 7 relations listed in Table 2 or ‘no relation’.
  model (cased), BiLSTM+CRF as NER task layer                     More specifically, the input fed to the BERT RE model
                                                               is sentence text along with positions of entity pairs. We do
   The layout of the BERT NER neural architecture is shown     not make use of entity type information for the following
in Figure 2.                                                   reasons: 1) this end-to-end (i.e. tokens-to-relation) practice
makes the RE model more useful as a standalone tool that
does not require entity type; 2) when in prediction mode,           Table 3: NER task results: Precision(P), Recall(R), F1
the errors in entity prediction could propagate to the RE task,     Score(F).
which we mitigate by including only the entity position in-          NER Model          Type      P       R       F
formation. Figure 3 shows the neural architecture of our RE                             strict    67.76 71.98 69.80
task.                                                                BERTbase,uncased , exact     71.02 75.44 73.16
   For training purposes, we randomly generate negative              Sof tmax           partial 75.28 79.96 77.55
samples for the ‘no relation’ class as two entities can have                            macro 62.65 66.83 64.63
no relations with each other. We use two ways to obtain neg-                            strict    67.82 71.66 69.68
ative samples: one way is to randomly choose two unrelated           BERTbase,cased ,   exact     71.19 75.22 73.15
entities in a sentence, the other is to break an existing related    Sof tmax           partial 75.41 79.68 77.49
entity pair and establish a non-related pair between one of                             macro 63.04 66.37 64.63
the entities in the original pair and another unrelated entity                          strict    68.73 72.60 70.61
in the sentence.                                                     BioBERT,           exact     71.87 75.91 73.83
   Similar to the NER task, we experiment with 3 pre-trained         Sof tmax           partial 75.99 80.26 78.06
models with softmax as the task layer for all of them:                                  macro 62.97 67.27 65.03
                                                                                        strict    68.59 72.06 70.28
• BERTbase,uncased : BERT base pre-trained model, un-                BERTbase,uncased , exact     71.85 75.49 73.62
  cased                                                              BiLST M + CRF partial 76.10 79.95 77.98
• BERTbase,cased : BERT base pre-trained model, cased                                   macro 63.43 66.45 64.88
• BioBERT : BioBERT pre-trained model (cased)                                           strict    68.09 71.80 69.89
  Following hyperparameter configuration is                used:     BERTbase,cased ,   exact     71.34 75.22 73.23
num of epochs=20, learning rate=2 ∗ 10−5 ,                 train-    BiLST M + CRF partial 75.55 79.67 77.56
ing batch size=32, max sequence length=32.                                              macro 62.68 66.41 64.45
                                                                                        strict    69.12 72.47 70.76
                  Results and Analysis                               BioBERT,           exact     72.35 75.85 74.06
                                                                     BiLST M + CRF partial 76.55 80.25 78.36
We implement the NER and RE tasks using Tensorflow                                      macro 63.79 67.44 65.54
based on the BERT neural architecture and run experiments
on an AWS p2.xlarge GPU instance.
                                                                    This finding suggests that applying lowercase to prepro-
NER Results                                                         cessing actually enhances performance slightly, which is
We follow the practice in the SemEval-2013’s Drug-Drug              counter-intuitive for NER tasks as the entities are often case-
Interactions task and evaluate NER performance by 3 match-          sensitive. Meanwhile, we also find that the two BioBERT
ing standards: strict, exact, and partial (Segura-Bedmar,           models, which are cased, perform better than their peer mod-
Martı́nez, and Herrero-Zazo 2013). The strict matching eval-        els of the same neural architecture. But since BioBERT only
uates both boundary and entity type of entity phrases; the          offers the cased option, we cannot discern the relative contri-
exact matching evaluates the exact boundary regardless of           bution from being cased in the BioBERT pre-trained model.
entity type; and the partial matching measures the partial             From Table 3, it is not surprising that for a given model,
boundary of entities regardless of entity type (thus the most       the partial evaluation usually holds the highest score, fol-
lenient). We calculate precision(P)/recall(R)/f1-score(F) for       lowed by exact, strict, and macro. Another observation is
the three evaluation types, and additionally, we also report        that when we loosen evaluation type from strict to exact, i.e.
macro average P/R/F results. The results are shown in Table         focusing on entity boundary without penalizing entity type
3.                                                                  errors, the performance is improved but still remains in the
   In our experiments, fine-tuning the pre-trained BioBERT          73.15-74.06 range, suggesting that the experimented BERT
model achieves slightly better performance than its BERT            based models fail identify entity boundary very precisely,
counterparts. For example, BioBERT, Sof tmax has f1-                which can be of interest for future investigation.
score of 70.61, better than BERTbase,uncased , Sof tmax’s              In our experiments with simple Softmax as the task layer,
69.80 and BERTbase,cased , Sof tmax’s 69.68. Simi-                  we observe more boundary detection errors. This in fact
larly, BioBERT, BiLST M + CRF holds a higher f1-                    is the motivation for us to add the BiLSTM+CRF lay-
score than BERTbase,uncased , BiLST M + CRF and                     ers as the NER task layer. However, the results show that
BERTbase,cased , BiLST M + CRF for all the four eval-               given the same pre-trained model configuration, it is de-
uation types.                                                       batable that BiLSTM+CRF could consistently improve per-
   When comparing the cased and uncased strategies, we              formance. For example, BioBERT, BiLST M + CRF
notice that the uncased pre-trained models outperform               slightly outperforms BioBERT, Sof tmax in strict match-
the cased ones with the same neural architecture: e.g.              ing precision and f1-score, but BioBERT, Sof tmax beats
BERTbase,uncased , BiLST M + CRF achieves f1-score                  BioBERT, BiLST M + CRF in strict matching recall.
of 70.28 for the strict evaluation type, higher than the f1-           We also find that the recall score is consistently higher
score of 69.89 from BERTbase,cased , BiLST M + CRF .                than the precision score for all models at all evaluation stan-
dards, indicating that the models tend to have more false pos-   domain where many terms are represented in capital let-
itive predictions than false negative predictions. The macro     ters. Secondly, BioBERT beating BERTbase,cased with a
scores show lower performance than strict/exact/partial be-      small margin may suggest that although pre-training in the
cause it simply averages the performance of different entity     biomedical domain could bring in some benefit, it is still not
types and some small-sample entity types have lower perfor-      specific enough for clinical trials. Since there is no uncased
mance due to lack of training data.                              BioBERT pre-trained model available, it is unclear whether
   Overall, BioBERT, BiLST M + CRF produces the                  training on biomedical corpus with lowercasing preprocess-
best precision and f1-scores for all the four evaluation types   ing could synergistically improve the performance. Con-
whereas BioBERT, Sof tmax holds the highest recalls.             sidering the big improvement from BERTbase,cased to
These results suggest that fine-tuning BioBERT lends itself      BERTbase,uncased , we believe the uncased scenario of cur-
better to the NER tasks in the clinical trial domain, which      rent BioBERT model is worth future investigation.
seems intuitive. But for task layer, the choice between Soft-
max and BiLSTM+CRF does not significantly affect the per-        Error Analysis
formance.                                                        We present and inspect NER prediction results from one of
                                                                 the models (BERTbase,uncased , Sof tmax) in a Brat server,
RE Results                                                       an open source tool that can help visualize annotation results
RE evaluation results are shown in Table 4, in which we re-      using color bars (Stenetorp et al. 2012). We overlay human
port micro/macro/weighted precision(P), recall(R), and f1-       and prediction annotations together in Brat to facilitate the
score(F).                                                        comparison.
                                                                    The NER errors can be broadly categorized into bound-
                                                                 ary errors and entity type errors, as reflected by the four
Table 4: RE task results: Precision(P), Recall(R), F1            evaluation types. For boundary errors, one pattern is that
Score(F).                                                        BERT tends to mis-annotate some words inside a multi-word
 RE Model          Type        P        R       F                phrase. For example, as shown in Figure 4, “at least a 3
                   micro       78.10 79.49 78.79                 month” is one temporal constraint entity, but the NER model
 BERTbase,uncased macro        76.43 76.22 76.24                 only captures “at”+“3 month” while misses the words in the
                   weighted 78.03 79.49 78.72                    middle (“least a”). This reflects a potential problem with
                   micro       73.61 75.33 74.46                 BERT NER models: although it can assign entity classes
 BERTbase,cased    macro       69.56 68.63 68.80                 relatively well, lack of structure enforcement on its output
                   weighted 73.41 75.33 74.27                    layer may possibly cause the inconsistent label within a full
                   micro       74.37 74.83 74.60                 phrase.
 BioBERT           macro       70.30 68.34 69.08
                   weighted 74.17 74.83 74.44

   From the above performance chart, we find
that BERTbase,uncased has the highest f1-scores,
whereas BERTbase,cased has the lowest. Comparing
BERTbase,cased and BioBERT indicates that BioBERT                Figure 4: An example of the NER engine mis-annotating to-
can help with performance slightly, at least for this cased      kens within a phrase.
scenario. On the other hand, BERTbase,uncased noticeably
improves over its cased peer, BERTbase,cased , by a 4.33            In some cases, the NER model captures longer enti-
percentage margin. Therefore, just like the NER task, the        ties than the human annotator. For example, the model
RE task is also case insensitive, probably because uncased       annotates “[cardiac mechanical assist device]|Device”;
situations reduce vocabulary variations in processing. We        whereas the gold standard annotates the same phrase as
also observe that recall and precision are close to each other   “[cardiac]|AnatomicLocation” + “[mechanical assist de-
with precision slightly higher for the macro evaluation, but     vice]|Device”. In some other cases, the situation reverses
on the contrary, precision is slightly higher than recall for    and the NER model chunks one entity in the gold stan-
micro and weighted. These observations suggest that the          dard into multiple ones. For example, “[non-steroidal
model has higher precision score than recall score in classes    anti-inflammatory drugs]|Drug” is chunked into a Quali-
with less samples, such as ‘is located’ and ‘is negated’         fier/Modifier and a drug: “[non-steroidal]|Qualifier/Modifier
(in Table 1). And when doing macro evaluation, the               [anti-inflammatory drugs]|Drug”. The boundary merging
contribution from the smaller classes becomes more visible.      and chunking issues, as illustrated by these two examples,
   Overall, the BERTbase,uncased model prevails - it out-        occur frequently with the Qualifier/Modifier class as it is ar-
performs the other two models on each evaluation type and        guable that a complex term can be annotated by one whole
measures. For example, it has f1-score of 78.79 for mi-          entity or as a Qualifier/Modifier plus an entity.
cro, compared to BERTbase,cased ’s 74.46 and BioBERT ’s             For the entity type error, we observe a few cases, such
74.60. These results indicate again that the lowercasing pre-    as “urinalysis—Procedure type” is predicted as an Observa-
processing helps the NLP tasks even in the clinical trial        tion entity, and “gastrointestinal motility—Condition type”
is predicted as Drug. The type errors occur less frequently       heart conditions sharing the same suffix ’carditis’, are
than boundary errors according to our manual inspection.          however represented as ’my’+’##oca’+’##rdi’+’##tis’ and
   For the RE task, we manually screen the predictions from       ’per’+’##ica’+’##rdi’+’##tis’ respectively. This way of to-
the BERTbase,uncased , Sof tmax model against the gold            kenization does not represent the suffix in a biomedically
standards. We first observe that the NER boundary errors          meaningful way due to the lack of biomedical subwords in
can propagate to the RE task. Note that we only use named         the vocabulary. We assume subwords generated from the
entity positions but not types in the RE task, and there-         biomedical domain reflecting word root patterns can fur-
fore only NER boundary errors can affect the RE perfor-           ther enhance the word representation for BERT models and
mance. For example, “Transient neurologic deficits”, an-          thus improve downstream task performance. We can train a
notated as one Condition entity in the gold standard, is          BERT model from scratch using a biomedical corpus and a
split into “Transient—Qualifier/Modifier” and “neurologic         biomedical subword vocabulary.
deficits—Condition”, thus causing the RE task to predict a           The second strategy is to deploy multi-task co-training:
‘modified by’ relation between the two entities which actu-       since NER and RE tasks are dependent on each other,
ally does not exist in the gold standard. Another major cat-      namely, knowing one task’s output can facilitate the other
egory of RE classification error is that a number of actual       task’s, and therefore joint learning on them is expected to
relations misclassified as ’no relation’, while misclassifica-    improve performances for both.
tion between other classes is much less frequent.                    Our third strategy for future improvement is to reduce un-
                                                                  necessary relations currently predicted from the RE model.
           Conclusion and Future Work                             Our current greedy prediction pipeline enumerates all pos-
In this study, we focus on extracting clinically relevant terms   sible entity pairs that results in an unnecessarily large test-
and relations from protocol eligibility criteria by applying      ing base set. One way to address this issue is to consider
pre-trained transformer deep learning NLP models for NER          dependency parsing information, which can be used to in-
and RE tasks. We experiment with several configurations of        dicate whether two terms has dependency relations to prune
the pre-trained BERT models and report our results and find-      unnecessary entity pairs.
ings.                                                                The extracted information from the NER and RE tasks
   Our results demonstrated the effectiveness of NLP mod-         has the great potential of assisting drug development busi-
els in processing clinical trial protocols. Despite of the fact   ness especially for study feasibility analysis. The derived in-
that the processed texts are unique with specific clinical and    formation is the basis for a local knowledge graph for the
medical terms and logical relations, BERT and BioBERT             protocols and a global graph when merging with external
models returned acceptable performances. We also find that        structured information such as drug ontologies. In conclu-
in general, BioBERT, which is pre-trained on biomedical           sion, this is our first step towards a greater mission to apply
corpus, outperforms BERT, which is pre-trained on general         deep learning to business cases in drug development, and
domain corpus. This agrees with the general understanding         the subsequent analysis based on the derived graph can even
of the importance of domain-specific training for achieving       further enhance our contribution and insights to this research
higher model performance in domain-specific tasks.                area.
   A surprising finding is that even though the clinical trial
domain largely contains capitalized terminologies, lower-
casing preprocessing improves the performances of both                                    References
NER and RE tasks. Our hypothesis is that maintaining less         Alsentzer, E.; Murphy, J. R.; Boag, W.; Weng, W.-H.;
token variation (i.e. lowercasing has less variation) is more     Jin, D.; Naumann, T.; and McDermott, M. 2019. Pub-
important than maintaining casing for these tasks.                licly available clinical bert embeddings. arXiv preprint
   It is also worth noting that there are rooms to improve        arXiv:1904.03323.
the quality of our gold standard. Due to the complex nature
                                                                  Bach, N., and Badaskar, S. 2007. A review of relation ex-
of the protocols that cover many different sub-domains in
                                                                  traction. Literature review for Language and Statistics II 2.
biomedical and clinical sciences such as therapeutic areas,
even human experts can easily make mistakes or be inconsis-       Beltagy, I.; Cohan, A.; and Lo, K. 2019. Scibert: Pretrained
tent. In fact, we found many cases that the model predictions     contextualized embeddings for scientific text. arXiv preprint
are in fact correct, although different from the gold standard.   arXiv:1903.10676.
To address this annotation quality issue, we employed an it-      Bikel, D. M.; Miller, S.; Schwartz, R.; and Weischedel, R.
erative annotating pipeline that asks human experts to verify     1998. Nymble: a high-performance learning name-finder.
the pre-annotated documents by the NLP models. We antic-          arXiv preprint cmp-lg/9803003.
ipat that this practice can help partly address this issue.
   We believe that the model performance can be fur-              Dai, Z.; Yang, Z.; Yang, Y.; Cohen, W. W.; Carbonell, J.;
ther improved. To do that, we can further explore in sev-         Le, Q. V.; and Salakhutdinov, R. 2019. Transformer-xl:
eral directions. The first approach is to train a biomed-         Attentive language models beyond a fixed-length context.
ical BERT model using a domain-specific vocabulary                arXiv preprint arXiv:1901.02860.
from scratch. BERT model handles tokens by splitting              Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
them into subwords using a predefined subword vocab-              Bert: Pre-training of deep bidirectional transformers for lan-
ulary. For example, ’myocarditis’ and ’pericarditis’, two         guage understanding. arXiv preprint arXiv:1810.04805.
Howard, J., and Ruder, S. 2018. Universal language                Association for Computational Linguistics, 102–107. Avi-
model fine-tuning for text classification. arXiv preprint         gnon, France: Association for Computational Linguistics.
arXiv:1801.06146.                                                 Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional                 L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
lstm-crf models for sequence tagging. arXiv preprint              tention is all you need. In Advances in neural information
arXiv:1508.01991.                                                 processing systems, 5998–6008.
Jurafsky, D. 2000. Speech & language processing. Pearson          Wei, Q.; Chen, T.; Xu, R.; He, Y.; and Gui, L. 2016. Disease
Education India.                                                  named entity recognition by combining conditional random
Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Con-         fields and bidirectional recurrent neural networks. Database
ditional random fields: Probabilistic models for segmenting       2016.
and labeling sequence data. ICML proceedings.                     Yang, Z.; Salakhutdinov, R.; and Cohen, W. 2016. Multi-
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami,           task cross-lingual sequence tagging from scratch. arXiv
K.; and Dyer, C. 2016. Neural architectures for named entity      preprint arXiv:1603.06270.
recognition. arXiv preprint arXiv:1603.01360.                     Yuan, C.; Ryan, P. B.; Ta, C.; Guo, Y.; Li, Z.; Hardin, J.;
Leaman, R., and Gonzalez, G. 2008. Banner: an executable          Makadia, R.; Jin, P.; Shang, N.; Kang, T.; et al. 2019. Crite-
survey of advances in biomedical named entity recognition.        ria2query: a natural language interface to clinical databases
In Biocomputing 2008. World Scientific. 652–663.                  for cohort definition. Journal of the American Medical In-
                                                                  formatics Association 26(4):294–305.
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C. H.; and
Kang, J. 2019. Biobert: a pre-trained biomedical language
representation model for biomedical text mining. Bioinfor-
matics.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy,
O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019.
Roberta: A robustly optimized bert pretraining approach.
arXiv preprint arXiv:1907.11692.
Lyu, C.; Chen, B.; Ren, Y.; and Ji, D. 2017. Long short-term
memory rnn for biomedical named entity recognition. BMC
bioinformatics 18(1):462.
Ma, X., and Hovy, E. 2016. End-to-end sequence la-
beling via bi-directional lstm-cnns-crf. arXiv preprint
arXiv:1603.01354.
McCallum, A.; Freitag, D.; and Pereira, F. C. 2000. Maxi-
mum entropy markov models for information extraction and
segmentation. In ICML, volume 17, 591–598.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
word representations. arXiv preprint arXiv:1802.05365.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
Sutskever, I. 2019. Language models are unsupervised mul-
titask learners. OpenAI Blog 1(8).
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.;
Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Explor-
ing the limits of transfer learning with a unified text-to-text
transformer. arXiv e-prints.
Ramshaw, L. A., and Marcus, M. P. 1999. Text chunking
using transformation-based learning. In Natural language
processing using very large corpora. Springer. 157–176.
Segura-Bedmar, I.; Martı́nez, P.; and Herrero-Zazo, M.
2013. Semeval-2013 task 9: Extraction of drug-drug inter-
actions from biomedical texts (ddiextraction 2013). Associ-
ation for Computational Linguistics.
Stenetorp, P.; Pyysalo, S.; Topić, G.; Ohta, T.; Ananiadou,
S.; and Tsujii, J. 2012. brat: a web-based tool for NLP-
assisted text annotation. In Proceedings of the Demonstra-
tions at the 13th Conference of the European Chapter of the