Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis

Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis MiaoChen miao.chen@covance.com Covance

8211 SciCor Drive Indianapolis IN USA

FangDu fang.du@covance.com Covance

206 Carnegie Center Princeton NJ USA

GanhuiLan ganhui.lan@covance.com Covance

206 Carnegie Center Princeton NJ USA

VictorLobanov victor.lobanov@covance.com Covance

206 Carnegie Center Princeton NJ USA

Stanford University

Palo Alto California USA

Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis 2CB813087667938702BDDA8FD4A83196 GROBID - A machine learning software for extracting information from scholarly documents

Transformer deep learning models, such as BERT, have demonstrated their effectiveness over previous baselines on a broad range of general-domain natural language processing (NLP) tasks such as classification, named entity recognition, and question answering (Devlin et al. 2018). They also exhibit enhanced performance in domain-specific NLP tasks, including BioNLP tasks (Lee et al. 2019;Alsentzer et al. 2019). In this study, we focus on clinical trial protocols: exploring and extracting key terms (a named entity recognition task) as well as their relations (a relation extraction task) from the protocols using transformer pre-trained deep learning models. We compare several model configurations and report their results. Our NLP model achieves good performance considering the complex and unique nature of the language in real-world protocols, and has been integrated into the organization's protocol analytics practice. This approach and the extracted information will greatly facilitate trial feasibility analysis for developing new drugs.

Introduction

Clinical trial protocols (often called "study protocols") contain key information specifying the trial design and implementation, but are usually in unstructured or semi-structured format, which presents a huge challenge for running computational analysis on them. Due to protocols' critical role, drug development businesses, such as contract research organizations, have been devoting significant amount of resources in analyzing study protocols to precisely understand the operational requirements, comprehensively evaluate the systemic challenges, unbiasedly assess the probability of success, accurately forecast the cost implications for optimal business planning. Currently, this protocol analysis work is still performed in a labor-intensive fashion, involving numerous resource checking and cross referencing works. To develop safer, cheaper and more effective drugs faster for better public health, this presses an urgent need for more efficient and effective ways to process text-based protocols.

Here, we present our efforts to facilitate the protocol analysis workflow by automating the process of extracting key information from the protocols using natural language processing (NLP) techniques. More specifically, we focus on the eligibility criteria section in the protocols, which contains patient selection criteria information; we extract key clinically relevant entities (i.e. named entities) and entity relations (i.e. syntactic relations) from this section. Based on the extracted information, the unstructured protocols can be transformed into a structured network with interconnected key entities (e.g. condition, drug, observation etc.) that can be fed into various data-based analytic tasks, for example to query against various real-world evidence databases for patient population estimation, which is critical for clinical trial design in drug development.

Covance Inc. is the world's largest provider for clinical trial design, monitoring, managing and central lab testing services, and has accumulated large volume of study protocols. The presented work is our first step of a bigger mission towards solving the protocol analysis challenge. To this end, we employed the transfer learning strategy and experiment with deep learning family of algorithms by using the recently developed Bidirectional Encoder Representations from Transformers (BERT) based models and fine-tuning them on our in-house clinical trial protocol corpus to identify the named entities and their relations.

Study protocols are rigorous scientific documents with highly domain-specific terms and complex relations. These characteristics bring both benefits and challenges to NLP work: we concern less about preprocessing due to its rigorous use of language, but need to attend more to its unique yet complex clinical terms and relations. A study protocol's eligibility criteria section is usually composed of two parts: inclusion criteria and exclusion criteria, which respectively describe the unambiguous characteristics of patients to be included in and excluded from the clinical trial. The general public can access some simplified protocol texts via websites such as ClinicalTrials.gov, which already contain many clinical terminologies. However, the real protocols are much longer with even more domain-specific terms, thus more difficult for the NLP task. We employ pre-trained BERT transformers to tackle this challenging NLP task and our study provides quantified evidence of how BERT performs in the clinical trial domain. In our practice, the extracted information are stored in a structured format. Figure 1 shows an example: the inclusion criteria is represented as several key-value clauses such that we can query a patient database to find the patients satisfying these criteria. Through extraction we are essentially connecting dots to build a larger graph for knowledge engineering purpose, i.e. we connect protocol text to patient database records, connect protocol to condition terms in a medical ontology, and so on. Once the dots are properly connected, we are empowered to perform many protocol analysis tasks such as building a search engine for precise search, composing graph networks for graph analysis for capturing missing links, evaluating drug effectiveness by comparing with similar drugs, clustering and recommending similar protocols for study feasibility analysis.

Related Work

Named entities recognition (NER) and relation extraction (RE) are two classical natural language processing (NLP) tasks, which we carry out to extract entities and syntactic relations respectively in our study. Previously, for NER, researchers have mainly investigated probabilistic sequence labeling models such like conditional random fields (CRF), maximum entropy Markov models, and hidden Markov model (Lafferty, McCallum, and Pereira 2001;McCallum, Freitag, and Pereira 2000;Bikel et al. 1998). For RE, text classification methods, such as support vector machine, logistic regression, and perceptron, along with feature engineering, have been used to assign relations between entities (Bach and Badaskar 2007;Jurafsky 2000).

In recent years, with the advances in deep neural network methods, significant performance improvement has been achieved for the NER and RE tasks. For NER tasks, embeddings are widely used in neural network models to represent words or characters as high-dimensional vectors. Recurrent neural networks (RNN), including LSTM, GRU, and their variants, are applied because their architectures represent better the sentence context as well as the dynamic sentence length in natural languages (Huang, Xu, and Yu 2015;Yang, Salakhutdinov, and Cohen 2016). The Bidirectional LSTM (Bi-LSTM) plus CRF network architecture has also been widely used to achieve better NER performance (Ma and Hovy 2016;Lample et al. 2016).

Despite the improvement from previous models, RNN and LSTM models tend to "forget" earlier context in long sequences, which limits the model performance. Transformers are subsequently proposed to counter this issue. Transformer models use the attention mechanism that attends to each word in a sequence by replacing the sequence-based RNN style network structure with dot products and multiplications between the key/value/query matrices projected from the embedding vectors (Vaswani et al. 2017). Transformers have the advantage of attending to every token in a sequence, whether long or short, and therefore they can capture associations between tokens that are even distantly separated from each other. BERT models (Bidirectional Encoder Representations from Transformers), a recent popular NLP deep learning model, is a model employing multiple layers of attentions and significantly improved NLP task performance over previous models (Devlin et al. 2018).

Additionally, transfer learning aims to transfer pre-trained model from one task to another, usually by training a general language model on general-domain data set and transferring it to a downstream task by fine-tuning on the taskspecific data set. A number of pre-trained language models have been created to facilitate downstream tasks such as NER and RE, examples including ELMO, ULMFit, OpenAI GPT, and BERT, which have outperformed previous baselines and some even achieved the state-of-the-art performance (Peters et al. 2018;Howard and Ruder 2018;Radford et al. 2019).

Based on the original BERT architecture, a number of BERT variants have emerged with alterations for different purposes. For example, RoBERTa removes next sentence prediction from the original loss function along with some other hyperparameter changes; Transformer-XL captures context both within and between segments for tackling long-term dependency across sentences; and T5 advocates for encoding-decoding architecture, denoising objectives and other changes based on extensive experiments (Liu et al. 2019;Dai et al. 2019;Raffel et al. 2019).

NER and RE have also been longstanding tasks in the biomedical NLP domain. Researchers have investigated applying similar yet more customized approaches to biomedical texts, such as using CRF models and BiLSTM+CRF neural networks (Leaman and Gonzalez 2008;Lyu et al. 2017;Wei et al. 2016). With the introduction of the BERT model, BERT based models have been adopted to the biomedical domain by retraining it with biomedical corpus, among the examples are BioBERT, SciBERT, and clinical BERT (Lee et al. 2019;Beltagy, Cohan, and Lo 2019;Alsentzer et al. 2019).

In the clinical informatics field, it is important to convert unstructured criteria text to structured format because this enables people to automatically parse a criteria and query for proper patients against certain real-world evidence database. Therefore, NER and RE algorithms are an appropriate and natural fit to this practice: NER extracts concepts such as conditions and observations that is related to a patient; RE provides operational information such as the range for a particular lab test result for patient selection. Criteria2Query is a pioneering work in the space of translating study criteria to SQL queries (Yuan et al. 2019). It relies mainly on CRF sequence labeling for the NER task and SVM classification for relation extraction. To the best of our knowledge, there has been no research and practice to use pre-trained transformer deep learning methods to extract structured information from unstructured clinical trial protocols. Motivated by the excellent performance of BERT based models on NER and RE tasks in general domains, we experiment and develop models and evaluate the performance in the clinical trial domain.

Methodology Data Set

To facilitate our NLP approach, we selected 470 study protocols from Covance's in-house protocol database. And our protocol corpus comprises eligibility criteria sections from these selected study protocols. An eligibility criteria section typically contain 5 -20 sentences that define the criteria to select and recruit patients for the clinical study. Our data contain a total of 30,183 criteria sentences.

Data Annotation. We have the eligibility criteria annotated using the IOB format (Ramshaw and Marcus 1999). The corpus is annotated by well-trained biomedical domain experts as the gold standard for training and testing. They manually annotate the key clinical entities and their pairwise relations if there exist any. We focus on 15 types of entities and 7 types of relations that help clinically define a patient cohort:

Entities: Condition, Observation, Procedure, Device, Drug, Investigational product, Event, Refractory condition, Demographics, Measurement, Temporal constraints, Qualifier/modifier, Anatomic location, Negation cue, Permission cue Syntactic relations: Has value, Has temporal constraint, Modified by, Located in, Is negated, Is permitted, Specified by Data Split. For the NER task, we randomly split the 30,183 sentences into training (60%, 18,109 sentences) and test (40%, 12,074 sentences) sets. For the RE task, before splitting the data for training and testing, we first check whether a sentence contains multiple relations and if so, we duplicate the sentence for each pair of related entities and make their relation type as the label for classification. This results in 52,470 relation sample sentences, based on which we perform a random split with stratification on relation classes to derive training (60%, 31,482 relation samples) and test sets (40%, 20,988 relation samples). Tables 1 and 2 show data statistics for the NER and RE tasks.

NER Task

As previously mentioned, we use NER algorithms to extract clinically relevant entities in eligibility criteria section and particularly choose BERT, a pre-trained transformer type of deep learning model, because of its reported superior performance in many NLP tasks. Due to the attention transformer in BERT, it is able to provide dynamic context embedding for tokens, which helps addressing the polysemy issue. BERT is a language model pre-trained on a large general domain corpus and can be applied towards downstream tasks by adding simply structured task layers and fine tuning on task-specific data set. We hereby follow the fine tuning practice based on pre-trained models to derive our NER model (Devlin et al. 2018;Lee et al. 2019). We explore several options with regard to choice of pre-trained models and task layers. NER task layers. The original BERT paper indicates that when use for NER tasks, the pre-trained BERT model can be simply followed by a softmax layer where each token is classified to their most likely entity class without adding any CRF layer (Devlin et al. 2018). However, our experiments suggest that this approach sometimes fails to recognize contiguous phrases as whole entities. To address this issue, we further experiment the architecture with BiLSTM+CRF layers as the NER task layer for its potentially better ability in capturing bi-diretional context as well as tagging likelihood at the sentence level (as opposed to token level).

Cased or uncased. The BERT model provided by Google includes versions with and without lowercasing preprocessing on the tokens. We experiment with both the cased (not applying lowercasing) and uncased (applying lowercasing) options. Consequently, the two options use different set of subword vocabularies, with cased model of 28,996 subwords and uncased model of 30,522 subwords.

Pre-trained models. We use BERT-base, a smaller ver- Hyperparameters. For both BERT-base and BioBERT models, we set num of epochs=20, learning rate=2 * 10 −5 , training batch size=32, max sequence length=32. For cases when using BiLSTM+CRF as task layers, we set bilstm layer size=128.

The above model options result in 6 NER models:

• BERT base,

RE Task

The RE task is also treated as a downstream task to the pretrained models. The original BERT paper did not include RE task as one of their downstream tasks, whereas the BioBERT study investigated it due to its importance in the biomedical NLP domain (Lee et al. 2019). BioBERT handles relation extraction as a classification task on the sentence or sequence level. In particular, it assumes that each sentence contains at most one relation and classifies whether a whole sentence, instead of a particular pair of entities, contains a relation of interest, e.g. Gene-disease relation. This approach is not directly applicable to our data for 2 reasons: 1) our data contain multiple types of relations, and 2) in our data set, one sentence often contains multiple relations (52,470 relations/30,183 sentences = 1.7 relations/sentence on average).

We employ the following strategy for the RE task: In training, we first scan through each sentence for entities using human annotations, and record the token positions of each entities; if a sentence contains n (n > 1) pairs of entities with human annotated relation, we duplicate this sentence n times so that each instance target represents one pair of entities and their relation; In prediction, we use NER pipeline results to locate entities, enumerate all legitimate entity pairs, and duplicate sentences accordingly. Since we record the token positions of each entity pair, we can get BERT output vectors for them based on their position information, concatenate the two vectors and then feed it to a softmax layer to classify their relation. The result can be one of the 7 relations listed in Table 2 or 'no relation'.

More specifically, the input fed to the BERT RE model is sentence text along with positions of entity pairs. We do not make use of entity type information for the following reasons: 1) this end-to-end (i.e. tokens-to-relation) practice makes the RE model more useful as a standalone tool that does not require entity type; 2) when in prediction mode, the errors in entity prediction could propagate to the RE task, which we mitigate by including only the entity position information. Figure 3 shows the neural architecture of our RE task.

For training purposes, we randomly generate negative samples for the 'no relation' class as two entities can have no relations with each other. We use two ways to obtain negative samples: one way is to randomly choose two unrelated entities in a sentence, the other is to break an existing related entity pair and establish a non-related pair between one of the entities in the original pair and another unrelated entity in the sentence.

Similar to the NER task, we experiment with 3 pre-trained models with softmax as the task layer for all of them:

• BERT base,uncased : BERT base pre-trained model, uncased • BERT base,cased : BERT base pre-trained model, cased

• BioBERT : BioBERT pre-trained model (cased)

Following hyperparameter configuration is used: num of epochs=20, learning rate=2 * 10 −5 , training batch size=32, max sequence length=32.

Results and Analysis

We implement the NER and RE tasks using Tensorflow based on the BERT neural architecture and run experiments on an AWS p2.xlarge GPU instance.

NER Results

We follow the practice in the SemEval-2013's Drug-Drug Interactions task and evaluate NER performance by 3 matching standards: strict, exact, and partial (Segura-Bedmar, Martínez, and Herrero-Zazo 2013). The strict matching evaluates both boundary and entity type of entity phrases; the exact matching evaluates the exact boundary regardless of entity type; and the partial matching measures the partial boundary of entities regardless of entity type (thus the most lenient). We calculate precision(P)/recall(R)/f1-score(F) for the three evaluation types, and additionally, we also report macro average P/R/F results. The results are shown in Table 3.

In our experiments, fine-tuning the pre-trained BioBERT model achieves slightly better performance than its BERT counterparts. For example, BioBERT, Sof tmax has f1score of 70.61, better than BERT base,uncased , Sof tmax's 69.80 and BERT base,cased , Sof tmax's 69.68. Similarly, BioBERT, BiLST M + CRF holds a higher f1score than BERT base,uncased , BiLST M + CRF and BERT base,cased , BiLST M + CRF for all the four evaluation types.

When comparing the cased and uncased strategies, we notice that the uncased pre-trained models outperform the cased ones with the same neural architecture: e.g. BERT base,uncased , BiLST M + CRF achieves f1-score of 70.28 for the strict evaluation type, higher than the f1score of 69.89 from BERT base,cased , BiLST M + CRF . This finding suggests that applying lowercase to preprocessing actually enhances performance slightly, which is counter-intuitive for NER tasks as the entities are often casesensitive. Meanwhile, we also find that the two BioBERT models, which are cased, perform better than their peer models of the same neural architecture. But since BioBERT only offers the cased option, we cannot discern the relative contribution from being cased in the BioBERT pre-trained model.

From Table 3, it is not surprising that for a given model, the partial evaluation usually holds the highest score, followed by exact, strict, and macro. Another observation is that when we loosen evaluation type from strict to exact, i.e. focusing on entity boundary without penalizing entity type errors, the performance is improved but still remains in the 73.15-74.06 range, suggesting that the experimented BERT based models fail identify entity boundary very precisely, which can be of interest for future investigation.

In our experiments with simple Softmax as the task layer, we observe more boundary detection errors. This in fact is the motivation for us to add the BiLSTM+CRF layers as the NER task layer. However, the results show that given the same pre-trained model configuration, it is debatable that BiLSTM+CRF could consistently improve performance. For example, BioBERT, BiLST M + CRF slightly outperforms BioBERT, Sof tmax in strict matching precision and f1-score, but BioBERT, Sof tmax beats BioBERT, BiLST M + CRF in strict matching recall.

We also find that the recall score is consistently higher than the precision score for all models at all evaluation stan-dards, indicating that the models tend to have more false positive predictions than false negative predictions. The macro scores show lower performance than strict/exact/partial because it simply averages the performance of different entity types and some small-sample entity types have lower performance due to lack of training data.

Overall, BioBERT, BiLST M + CRF produces the best precision and f1-scores for all the four evaluation types whereas BioBERT, Sof tmax holds the highest recalls. These results suggest that fine-tuning BioBERT lends itself better to the NER tasks in the clinical trial domain, which seems intuitive. But for task layer, the choice between Softmax and BiLSTM+CRF does not significantly affect the performance.

RE Results

RE evaluation results are shown in Table 4, in which we report micro/macro/weighted precision(P), recall(R), and f1score(F). From the above performance chart, we find that BERT base,uncased has the highest f1-scores, whereas BERT base,cased has the lowest. Comparing BERT base,cased and BioBERT indicates that BioBERT can help with performance slightly, at least for this cased scenario. On the other hand, BERT base,uncased noticeably improves over its cased peer, BERT base,cased , by a 4.33 percentage margin. Therefore, just like the NER task, the RE task is also case insensitive, probably because uncased situations reduce vocabulary variations in processing. We also observe that recall and precision are close to each other with precision slightly higher for the macro evaluation, but on the contrary, precision is slightly higher than recall for micro and weighted. These observations suggest that the model has higher precision score than recall score in classes with less samples, such as 'is located' and 'is negated' (in Table 1). And when doing macro evaluation, the contribution from the smaller classes becomes more visible.

Overall, the BERT base,uncased model prevails -it outperforms the other two models on each evaluation type and measures. For example, it has f1-score of 78.79 for micro, compared to BERT base,cased 's 74.46 and BioBERT 's 74.60. These results indicate again that the lowercasing preprocessing helps the NLP tasks even in the clinical trial domain where many terms are represented in capital letters. Secondly, BioBERT beating BERT base,cased with a small margin may suggest that although pre-training in the biomedical domain could bring in some benefit, it is still not specific enough for clinical trials. Since there is no uncased BioBERT pre-trained model available, it is unclear whether training on biomedical corpus with lowercasing preprocessing could synergistically improve the performance. Considering the big improvement from BERT base,cased to BERT base,uncased , we believe the uncased scenario of current BioBERT model is worth future investigation.

Error Analysis

We present and inspect NER prediction results from one of the models (BERT base,uncased , Sof tmax) in a Brat server, an open source tool that can help visualize annotation results using color bars (Stenetorp et al. 2012). We overlay human and prediction annotations together in Brat to facilitate the comparison.

The NER errors can be broadly categorized into boundary errors and entity type errors, as reflected by the four evaluation types. For boundary errors, one pattern is that BERT tends to mis-annotate some words inside a multi-word phrase. For example, as shown in Figure 4, "at least a 3 month" is one temporal constraint entity, but the NER model only captures "at"+"3 month" while misses the words in the middle ("least a"). This reflects a potential problem with BERT NER models: although it can assign entity classes relatively well, lack of structure enforcement on its output layer may possibly cause the inconsistent label within a full phrase. In some cases, the NER model captures longer entities than the human annotator. For example, the model annotates "[cardiac mechanical assist device]|Device"; whereas the gold standard annotates the same phrase as "[cardiac]|AnatomicLocation" + "[mechanical assist device]|Device". In some other cases, the situation reverses and the NER model chunks one entity in the gold standard into multiple ones. For example, "[non-steroidal anti-inflammatory drugs]|Drug" is chunked into a Qualifier/Modifier and a drug: "[non-steroidal]|Qualifier/Modifier [anti-inflammatory drugs]|Drug". The boundary merging and chunking issues, as illustrated by these two examples, occur frequently with the Qualifier/Modifier class as it is arguable that a complex term can be annotated by one whole entity or as a Qualifier/Modifier plus an entity.

For the entity type error, we observe a few cases, such as "urinalysis-Procedure type" is predicted as an Observation entity, and "gastrointestinal motility-Condition type" is predicted as Drug. The type errors occur less frequently than boundary errors according to our manual inspection.

For the RE task, we manually screen the predictions from the BERT base,uncased , Sof tmax model against the gold standards. We first observe that the NER boundary errors can propagate to the RE task. Note that we only use named entity positions but not types in the RE task, and therefore only NER boundary errors can affect the RE performance. For example, "Transient neurologic deficits", annotated as one Condition entity in the gold standard, is split into "Transient-Qualifier/Modifier" and "neurologic deficits-Condition", thus causing the RE task to predict a 'modified by' relation between the two entities which actually does not exist in the gold standard. Another major category of RE classification error is that a number of actual relations misclassified as 'no relation', while misclassification between other classes is much less frequent.

Conclusion and Future Work

In this study, we focus on extracting clinically relevant terms and relations from protocol eligibility criteria by applying pre-trained transformer deep learning NLP models for NER and RE tasks. We experiment with several configurations of the pre-trained BERT models and report our results and findings.

Our results demonstrated the effectiveness of NLP models in processing clinical trial protocols. Despite of the fact that the processed texts are unique with specific clinical and medical terms and logical relations, BERT and BioBERT models returned acceptable performances. We also find that in general, BioBERT, which is pre-trained on biomedical corpus, outperforms BERT, which is pre-trained on general domain corpus. This agrees with the general understanding of the importance of domain-specific training for achieving higher model performance in domain-specific tasks.

A surprising finding is that even though the clinical trial domain largely contains capitalized terminologies, lowercasing preprocessing improves the performances of both NER and RE tasks. Our hypothesis is that maintaining less token variation (i.e. lowercasing has less variation) is more important than maintaining casing for these tasks.

It is also worth noting that there are rooms to improve the quality of our gold standard. Due to the complex nature of the protocols that cover many different sub-domains in biomedical and clinical sciences such as therapeutic areas, even human experts can easily make mistakes or be inconsistent. In fact, we found many cases that the model predictions are in fact correct, although different from the gold standard. To address this annotation quality issue, we employed an iterative annotating pipeline that asks human experts to verify the pre-annotated documents by the NLP models. We anticipat that this practice can help partly address this issue.

We believe that the model performance can be further improved. To do that, we can further explore in several directions. The first approach is to train a biomedical BERT model using a domain-specific vocabulary from scratch. BERT model handles tokens by splitting them into subwords using a predefined subword vocabulary. For example, 'myocarditis' and 'pericarditis', two heart conditions sharing the same suffix 'carditis', are however represented as 'my'+'##oca'+'##rdi'+'##tis' and 'per'+'##ica'+'##rdi'+'##tis' respectively. This way of tokenization does not represent the suffix in a biomedically meaningful way due to the lack of biomedical subwords in the vocabulary. We assume subwords generated from the biomedical domain reflecting word root patterns can further enhance the word representation for BERT models and thus improve downstream task performance. We can train a BERT model from scratch using a biomedical corpus and a biomedical subword vocabulary.

The second strategy is to deploy multi-task co-training: since NER and RE tasks are dependent on each other, namely, knowing one task's output can facilitate the other task's, and therefore joint learning on them is expected to improve performances for both.

Our third strategy for future improvement is to reduce unnecessary relations currently predicted from the RE model. Our current greedy prediction pipeline enumerates all possible entity pairs that results in an unnecessarily large testing base set. One way to address this issue is to consider dependency parsing information, which can be used to indicate whether two terms has dependency relations to prune unnecessary entity pairs. The extracted information from the NER and RE tasks has the great potential of assisting drug development business especially for study feasibility analysis. The derived information is the basis for a local knowledge graph for the protocols and a global graph when merging with external structured information such as drug ontologies. In conclusion, this is our first step towards a greater mission to apply deep learning to business cases in drug development, and the subsequent analysis based on the derived graph can even further enhance our contribution and insights to this research area.

Figure 1 :1Figure 1: Structured information extracted from protocol eligibility criteria.

Figure 2 :2Figure 2: Neural architecture of the BERT NER task (with Softmax as the task layer).

uncased , Sof tmax: BERT base uncased pretrained model, softmax as NER task layer • BERT base,cased , Sof tmax: BERT base cased pretrained model, softmax as NER task layer • BioBERT, Sof tmax: BioBERT pre-trained model (cased), softmax as NER task layer • BERT base,uncased , BiLST M + CRF : BERT pretrained uncased model, BiLSTM+CRF as NER task layer • BERT base,cased , BiLST M + CRF : BERT base pretrained cased model, BiLSTM+CRF as NER task layer • BioBERT, BiLST M + CRF : BioBERT pre-trained model (cased), BiLSTM+CRF as NER task layer The layout of the BERT NER neural architecture is shown in Figure 2.

Figure 3 :3Figure 3: Neural architecture of the BERT RE task (with Softmax as the task layer).

Figure 4 :4Figure 4: An example of the NER engine mis-annotating tokens within a phrase.

Table 1 :1Train and test data counts for the NER task.EntityTrainTestCondition12,682 8,537Observation7,3095,218Procedure3,4062,234Device221140Drug7,7935,858Investigational product 329224Event2,4301,625Refractory condition381278Demographics498381Measurement4,5403,344Temporal constraints6,9684,589Qualifier/modifier7,8535,196Anatomic location427223Negation cue921615Permission cue1,236869

Table 2 :2Train and test data counts for the RE task.RelationTrainTestis negated703468is permitted1,009673modified by5,7153,810has value3,3262,218has temporal constraint 6,1694,112is located215143specified by3,7292,486no relation10,616 7,078total count31,482 20,988

Table 3 :3NER task results: Precision(P), Recall(R), F1 Score(F).NER ModelTypePRFstrict67.76 71.98 69.80BERT base,uncased ,exact71.02 75.44 73.16Sof tmaxpartial 75.28 79.96 77.55macro 62.65 66.83 64.63strict67.82 71.66 69.68BERT base,cased ,exact71.19 75.22 73.15Sof tmaxpartial 75.41 79.68 77.49macro 63.04 66.37 64.63strict68.73 72.60 70.61BioBERT,exact71.87 75.91 73.83Sof tmaxpartial 75.99 80.26 78.06macro 62.97 67.27 65.03strict68.59 72.06 70.28BERT base,uncased ,exact71.85 75.49 73.62BiLST M + CRFpartial 76.10 79.95 77.98macro 63.43 66.45 64.88strict68.09 71.80 69.89BERT base,cased ,exact71.34 75.22 73.23BiLST M + CRFpartial 75.55 79.67 77.56macro 62.68 66.41 64.45strict69.12 72.47 70.76BioBERT,exact72.35 75.85 74.06BiLST M + CRFpartial 76.55 80.25 78.36macro 63.79 67.44 65.54

Table 4 :4RE task results: Precision(P), Recall(R), F1 Score(F).RE ModelTypePRFmicro78.10 79.49 78.79BERT base,uncasedmacro76.43 76.22 76.24weighted 78.03 79.49 78.72micro73.61 75.33 74.46BERT base,casedmacro69.56 68.63 68.80weighted 73.41 75.33 74.27micro74.37 74.83 74.60BioBERTmacro70.30 68.34 69.08weighted 74.17 74.83 74.44

EAlsentzer JRMurphy WBoag W.-HWeng DJin TNaumann MMcdermott NBach SBadaskar Beltagy arXiv:1904.03323 arXiv:1903.10676 Scibert: Pretrained contextualized embeddings for scientific text KLo 2019. 2007. 2019 arXiv preprint Literature review for Language and Statistics II Nymble: a high-performance learning name-finder DMBikel SMiller RSchwartz RWeischedel arXiv preprint cmp-lg/9803003 1998 ZDai ZYang YYang WWCohen JCarbonell QVLe RSalakhutdinov arXiv:1901.02860 Transformer-xl: Attentive language models beyond a fixed-length context 2019 arXiv preprint Bert: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 2018 arXiv preprint Universal language model fine-tuning for text classification JHoward SRuder arXiv:1801.06146 2018 arXiv preprint Bidirectional lstm-crf models for sequence tagging ZHuang WXu KYu arXiv:1508.01991 Speech & language processing Pearson Education India 2015. 2000 arXiv preprint Conditional random fields: Probabilistic models for segmenting and labeling sequence data JLafferty AMccallum FCPereira arXiv:1603.01360 Neural architectures for named entity recognition 2001. 2016 arXiv preprint ICML proceedings Banner: an executable survey of advances in biomedical named entity recognition RLeaman GGonzalez Biocomputing 2008. 2008 Biobert: a pre-trained biomedical language representation model for biomedical text mining JLee WYoon SKim DKim SKim CHSo JKang 2019 Bioinformatics Roberta: A robustly optimized bert pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov arXiv:1907.11692 2019 arXiv preprint Long short-term memory rnn for biomedical named entity recognition CLyu BChen YRen DJi BMC bioinformatics 18 1 462 2017 XMa EHovy arXiv:1603.01354 End-to-end sequence labeling via bi-directional lstm-cnns-crf 2016 arXiv preprint Maximum entropy markov models for information extraction and segmentation AMccallum DFreitag FCPereira ICML 17 2000 MEPeters MNeumann MIyyer MGardner CClark KLee LZettlemoyer arXiv:1802.05365 Deep contextualized word representations 2018 arXiv preprint Language models are unsupervised multitask learners ARadford JWu RChild DLuan DAmodei ISutskever OpenAI Blog 1 8 2019 CRaffel NShazeer ARoberts KLee SNarang MMatena YZhou WLi PJLiu arXiv e-prints Exploring the limits of transfer learning with a unified text-to-text transformer 2019 Text chunking using transformation-based learning LARamshaw MPMarcus Natural language processing using very large corpora Springer 1999 from biomedical texts (ddiextraction ISegura-Bedmar PMartínez MHerrero-Zazo Semeval-2013 task 9: Extraction of drug-drug interactions 2013. 2013 brat: a web-based tool for NLPassisted text annotation PStenetorp SPyysalo GTopić TOhta SAnaniadou JTsujii Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

Avignon, France

2012 Association for Computational Linguistics Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez ŁKaiser IPolosukhin Advances in neural information processing systems 2017 Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks QWei TChen RXu YHe LGui ZYang RSalakhutdinov WCohen arXiv:1603.06270 Multitask cross-lingual sequence tagging from scratch 2016. 2016. 2016 arXiv preprint Crite-ria2query: a natural language interface to clinical databases for cohort definition CYuan PBRyan CTa YGuo ZLi JHardin RMakadia PJin NShang TKang Journal of the American Medical Informatics Association 26 4 2019