<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Miao</forename><surname>Chen</surname></persName>
							<email>miao.chen@covance.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Covance</orgName>
								<address>
									<addrLine>8211 SciCor Drive</addrLine>
									<settlement>Indianapolis</settlement>
									<region>IN</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fang</forename><surname>Du</surname></persName>
							<email>fang.du@covance.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Covance</orgName>
								<address>
									<addrLine>206 Carnegie Center</addrLine>
									<settlement>Princeton</settlement>
									<region>NJ</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ganhui</forename><surname>Lan</surname></persName>
							<email>ganhui.lan@covance.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Covance</orgName>
								<address>
									<addrLine>206 Carnegie Center</addrLine>
									<settlement>Princeton</settlement>
									<region>NJ</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Victor</forename><surname>Lobanov</surname></persName>
							<email>victor.lobanov@covance.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Covance</orgName>
								<address>
									<addrLine>206 Carnegie Center</addrLine>
									<settlement>Princeton</settlement>
									<region>NJ</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="institution">Stanford University</orgName>
								<address>
									<settlement>Palo Alto</settlement>
									<region>California</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Using Pre-trained Transformer Deep Learning Models to Identify Named Entities and Syntactic Relations for Clinical Protocol Analysis</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">2CB813087667938702BDDA8FD4A83196</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T15:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Transformer deep learning models, such as BERT, have demonstrated their effectiveness over previous baselines on a broad range of general-domain natural language processing (NLP) tasks such as classification, named entity recognition, and question answering <ref type="bibr" target="#b3">(Devlin et al. 2018)</ref>. They also exhibit enhanced performance in domain-specific NLP tasks, including BioNLP tasks <ref type="bibr" target="#b8">(Lee et al. 2019;</ref><ref type="bibr" target="#b0">Alsentzer et al. 2019)</ref>. In this study, we focus on clinical trial protocols: exploring and extracting key terms (a named entity recognition task) as well as their relations (a relation extraction task) from the protocols using transformer pre-trained deep learning models. We compare several model configurations and report their results. Our NLP model achieves good performance considering the complex and unique nature of the language in real-world protocols, and has been integrated into the organization's protocol analytics practice. This approach and the extracted information will greatly facilitate trial feasibility analysis for developing new drugs.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Clinical trial protocols (often called "study protocols") contain key information specifying the trial design and implementation, but are usually in unstructured or semi-structured format, which presents a huge challenge for running computational analysis on them. Due to protocols' critical role, drug development businesses, such as contract research organizations, have been devoting significant amount of resources in analyzing study protocols to precisely understand the operational requirements, comprehensively evaluate the systemic challenges, unbiasedly assess the probability of success, accurately forecast the cost implications for optimal business planning. Currently, this protocol analysis work is still performed in a labor-intensive fashion, involving numerous resource checking and cross referencing works. To develop safer, cheaper and more effective drugs faster for better public health, this presses an urgent need for more efficient and effective ways to process text-based protocols.</p><p>Here, we present our efforts to facilitate the protocol analysis workflow by automating the process of extracting key information from the protocols using natural language processing (NLP) techniques. More specifically, we focus on the eligibility criteria section in the protocols, which contains patient selection criteria information; we extract key clinically relevant entities (i.e. named entities) and entity relations (i.e. syntactic relations) from this section. Based on the extracted information, the unstructured protocols can be transformed into a structured network with interconnected key entities (e.g. condition, drug, observation etc.) that can be fed into various data-based analytic tasks, for example to query against various real-world evidence databases for patient population estimation, which is critical for clinical trial design in drug development.</p><p>Covance Inc. is the world's largest provider for clinical trial design, monitoring, managing and central lab testing services, and has accumulated large volume of study protocols. The presented work is our first step of a bigger mission towards solving the protocol analysis challenge. To this end, we employed the transfer learning strategy and experiment with deep learning family of algorithms by using the recently developed Bidirectional Encoder Representations from Transformers (BERT) based models and fine-tuning them on our in-house clinical trial protocol corpus to identify the named entities and their relations.</p><p>Study protocols are rigorous scientific documents with highly domain-specific terms and complex relations. These characteristics bring both benefits and challenges to NLP work: we concern less about preprocessing due to its rigorous use of language, but need to attend more to its unique yet complex clinical terms and relations. A study protocol's eligibility criteria section is usually composed of two parts: inclusion criteria and exclusion criteria, which respectively describe the unambiguous characteristics of patients to be included in and excluded from the clinical trial. The general public can access some simplified protocol texts via websites such as ClinicalTrials.gov, which already contain many clinical terminologies. However, the real protocols are much longer with even more domain-specific terms, thus more difficult for the NLP task. We employ pre-trained BERT transformers to tackle this challenging NLP task and our study provides quantified evidence of how BERT performs in the clinical trial domain. In our practice, the extracted information are stored in a structured format. Figure <ref type="figure" target="#fig_0">1</ref> shows an example: the inclusion criteria is represented as several key-value clauses such that we can query a patient database to find the patients satisfying these criteria. Through extraction we are essentially connecting dots to build a larger graph for knowledge engineering purpose, i.e. we connect protocol text to patient database records, connect protocol to condition terms in a medical ontology, and so on. Once the dots are properly connected, we are empowered to perform many protocol analysis tasks such as building a search engine for precise search, composing graph networks for graph analysis for capturing missing links, evaluating drug effectiveness by comparing with similar drugs, clustering and recommending similar protocols for study feasibility analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>Named entities recognition (NER) and relation extraction (RE) are two classical natural language processing (NLP) tasks, which we carry out to extract entities and syntactic relations respectively in our study. Previously, for NER, researchers have mainly investigated probabilistic sequence labeling models such like conditional random fields (CRF), maximum entropy Markov models, and hidden Markov model <ref type="bibr" target="#b6">(Lafferty, McCallum, and Pereira 2001;</ref><ref type="bibr" target="#b12">McCallum, Freitag, and Pereira 2000;</ref><ref type="bibr" target="#b1">Bikel et al. 1998)</ref>. For RE, text classification methods, such as support vector machine, logistic regression, and perceptron, along with feature engineering, have been used to assign relations between entities <ref type="bibr" target="#b0">(Bach and Badaskar 2007;</ref><ref type="bibr">Jurafsky 2000)</ref>.</p><p>In recent years, with the advances in deep neural network methods, significant performance improvement has been achieved for the NER and RE tasks. For NER tasks, embeddings are widely used in neural network models to represent words or characters as high-dimensional vectors. Recurrent neural networks (RNN), including LSTM, GRU, and their variants, are applied because their architectures represent better the sentence context as well as the dynamic sentence length in natural languages <ref type="bibr" target="#b5">(Huang, Xu, and Yu 2015;</ref><ref type="bibr" target="#b20">Yang, Salakhutdinov, and Cohen 2016)</ref>. The Bidirectional LSTM (Bi-LSTM) plus CRF network architecture has also been widely used to achieve better NER performance <ref type="bibr" target="#b11">(Ma and Hovy 2016;</ref><ref type="bibr">Lample et al. 2016)</ref>.</p><p>Despite the improvement from previous models, RNN and LSTM models tend to "forget" earlier context in long sequences, which limits the model performance. Transformers are subsequently proposed to counter this issue. Transformer models use the attention mechanism that attends to each word in a sequence by replacing the sequence-based RNN style network structure with dot products and multiplications between the key/value/query matrices projected from the embedding vectors <ref type="bibr" target="#b19">(Vaswani et al. 2017)</ref>. Transformers have the advantage of attending to every token in a sequence, whether long or short, and therefore they can capture associations between tokens that are even distantly separated from each other. BERT models (Bidirectional Encoder Representations from Transformers), a recent popular NLP deep learning model, is a model employing multiple layers of attentions and significantly improved NLP task performance over previous models <ref type="bibr" target="#b3">(Devlin et al. 2018)</ref>.</p><p>Additionally, transfer learning aims to transfer pre-trained model from one task to another, usually by training a general language model on general-domain data set and transferring it to a downstream task by fine-tuning on the taskspecific data set. A number of pre-trained language models have been created to facilitate downstream tasks such as NER and RE, examples including ELMO, ULMFit, OpenAI GPT, and BERT, which have outperformed previous baselines and some even achieved the state-of-the-art performance <ref type="bibr" target="#b13">(Peters et al. 2018;</ref><ref type="bibr" target="#b4">Howard and Ruder 2018;</ref><ref type="bibr" target="#b14">Radford et al. 2019)</ref>.</p><p>Based on the original BERT architecture, a number of BERT variants have emerged with alterations for different purposes. For example, RoBERTa removes next sentence prediction from the original loss function along with some other hyperparameter changes; Transformer-XL captures context both within and between segments for tackling long-term dependency across sentences; and T5 advocates for encoding-decoding architecture, denoising objectives and other changes based on extensive experiments <ref type="bibr" target="#b9">(Liu et al. 2019;</ref><ref type="bibr" target="#b2">Dai et al. 2019;</ref><ref type="bibr" target="#b15">Raffel et al. 2019)</ref>.</p><p>NER and RE have also been longstanding tasks in the biomedical NLP domain. Researchers have investigated applying similar yet more customized approaches to biomedical texts, such as using CRF models and BiLSTM+CRF neural networks <ref type="bibr" target="#b7">(Leaman and Gonzalez 2008;</ref><ref type="bibr" target="#b10">Lyu et al. 2017;</ref><ref type="bibr" target="#b20">Wei et al. 2016)</ref>. With the introduction of the BERT model, BERT based models have been adopted to the biomedical domain by retraining it with biomedical corpus, among the examples are BioBERT, SciBERT, and clinical BERT <ref type="bibr" target="#b8">(Lee et al. 2019;</ref><ref type="bibr">Beltagy, Cohan, and Lo 2019;</ref><ref type="bibr" target="#b0">Alsentzer et al. 2019)</ref>.</p><p>In the clinical informatics field, it is important to convert unstructured criteria text to structured format because this enables people to automatically parse a criteria and query for proper patients against certain real-world evidence database. Therefore, NER and RE algorithms are an appropriate and natural fit to this practice: NER extracts concepts such as conditions and observations that is related to a patient; RE provides operational information such as the range for a particular lab test result for patient selection. Criteria2Query is a pioneering work in the space of translating study criteria to SQL queries <ref type="bibr" target="#b21">(Yuan et al. 2019)</ref>. It relies mainly on CRF sequence labeling for the NER task and SVM classification for relation extraction. To the best of our knowledge, there has been no research and practice to use pre-trained transformer deep learning methods to extract structured information from unstructured clinical trial protocols. Motivated by the excellent performance of BERT based models on NER and RE tasks in general domains, we experiment and develop models and evaluate the performance in the clinical trial domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methodology Data Set</head><p>To facilitate our NLP approach, we selected 470 study protocols from Covance's in-house protocol database. And our protocol corpus comprises eligibility criteria sections from these selected study protocols. An eligibility criteria section typically contain 5 -20 sentences that define the criteria to select and recruit patients for the clinical study. Our data contain a total of 30,183 criteria sentences.</p><p>Data Annotation. We have the eligibility criteria annotated using the IOB format <ref type="bibr" target="#b16">(Ramshaw and Marcus 1999)</ref>. The corpus is annotated by well-trained biomedical domain experts as the gold standard for training and testing. They manually annotate the key clinical entities and their pairwise relations if there exist any. We focus on 15 types of entities and 7 types of relations that help clinically define a patient cohort:</p><p>Entities: Condition, Observation, Procedure, Device, Drug, Investigational product, Event, Refractory condition, Demographics, Measurement, Temporal constraints, Qualifier/modifier, Anatomic location, Negation cue, Permission cue Syntactic relations: Has value, Has temporal constraint, Modified by, Located in, Is negated, Is permitted, Specified by Data Split. For the NER task, we randomly split the 30,183 sentences into training (60%, 18,109 sentences) and test (40%, 12,074 sentences) sets. For the RE task, before splitting the data for training and testing, we first check whether a sentence contains multiple relations and if so, we duplicate the sentence for each pair of related entities and make their relation type as the label for classification. This results in 52,470 relation sample sentences, based on which we perform a random split with stratification on relation classes to derive training (60%, 31,482 relation samples) and test sets (40%, 20,988 relation samples). Tables <ref type="table" target="#tab_1">1  and 2</ref> show data statistics for the NER and RE tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>NER Task</head><p>As previously mentioned, we use NER algorithms to extract clinically relevant entities in eligibility criteria section and particularly choose BERT, a pre-trained transformer type of deep learning model, because of its reported superior performance in many NLP tasks. Due to the attention transformer in BERT, it is able to provide dynamic context embedding for tokens, which helps addressing the polysemy issue. BERT is a language model pre-trained on a large general domain corpus and can be applied towards downstream tasks by adding simply structured task layers and fine tuning on task-specific data set. We hereby follow the fine tuning practice based on pre-trained models to derive our NER model <ref type="bibr" target="#b3">(Devlin et al. 2018;</ref><ref type="bibr" target="#b8">Lee et al. 2019</ref>). We explore several options with regard to choice of pre-trained models and task layers. NER task layers. The original BERT paper indicates that when use for NER tasks, the pre-trained BERT model can be simply followed by a softmax layer where each token is classified to their most likely entity class without adding any CRF layer <ref type="bibr" target="#b3">(Devlin et al. 2018</ref>). However, our experiments suggest that this approach sometimes fails to recognize contiguous phrases as whole entities. To address this issue, we further experiment the architecture with BiLSTM+CRF layers as the NER task layer for its potentially better ability in capturing bi-diretional context as well as tagging likelihood at the sentence level (as opposed to token level).</p><p>Cased or uncased. The BERT model provided by Google includes versions with and without lowercasing preprocessing on the tokens. We experiment with both the cased (not applying lowercasing) and uncased (applying lowercasing) options. Consequently, the two options use different set of subword vocabularies, with cased model of 28,996 subwords and uncased model of 30,522 subwords.</p><p>Pre-trained models. We use BERT-base, a smaller ver- Hyperparameters. For both BERT-base and BioBERT models, we set num of epochs=20, learning rate=2 * 10 −5 , training batch size=32, max sequence length=32. For cases when using BiLSTM+CRF as task layers, we set bilstm layer size=128.</p><p>The above model options result in 6 NER models:  </p><formula xml:id="formula_0">• BERT base,</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RE Task</head><p>The RE task is also treated as a downstream task to the pretrained models. The original BERT paper did not include RE task as one of their downstream tasks, whereas the BioBERT study investigated it due to its importance in the biomedical NLP domain <ref type="bibr" target="#b8">(Lee et al. 2019)</ref>. BioBERT handles relation extraction as a classification task on the sentence or sequence level. In particular, it assumes that each sentence contains at most one relation and classifies whether a whole sentence, instead of a particular pair of entities, contains a relation of interest, e.g. Gene-disease relation. This approach is not directly applicable to our data for 2 reasons: 1) our data contain multiple types of relations, and 2) in our data set, one sentence often contains multiple relations (52,470 relations/30,183 sentences = 1.7 relations/sentence on average).</p><p>We employ the following strategy for the RE task: In training, we first scan through each sentence for entities using human annotations, and record the token positions of each entities; if a sentence contains n (n &gt; 1) pairs of entities with human annotated relation, we duplicate this sentence n times so that each instance target represents one pair of entities and their relation; In prediction, we use NER pipeline results to locate entities, enumerate all legitimate entity pairs, and duplicate sentences accordingly. Since we record the token positions of each entity pair, we can get BERT output vectors for them based on their position information, concatenate the two vectors and then feed it to a softmax layer to classify their relation. The result can be one of the 7 relations listed in Table <ref type="table" target="#tab_1">2</ref> or 'no relation'.</p><p>More specifically, the input fed to the BERT RE model is sentence text along with positions of entity pairs. We do not make use of entity type information for the following reasons: 1) this end-to-end (i.e. tokens-to-relation) practice makes the RE model more useful as a standalone tool that does not require entity type; 2) when in prediction mode, the errors in entity prediction could propagate to the RE task, which we mitigate by including only the entity position information. Figure <ref type="figure" target="#fig_3">3</ref> shows the neural architecture of our RE task.</p><p>For training purposes, we randomly generate negative samples for the 'no relation' class as two entities can have no relations with each other. We use two ways to obtain negative samples: one way is to randomly choose two unrelated entities in a sentence, the other is to break an existing related entity pair and establish a non-related pair between one of the entities in the original pair and another unrelated entity in the sentence.</p><p>Similar to the NER task, we experiment with 3 pre-trained models with softmax as the task layer for all of them:</p><p>• BERT base,uncased : BERT base pre-trained model, uncased • BERT base,cased : BERT base pre-trained model, cased</p><formula xml:id="formula_1">• BioBERT : BioBERT pre-trained model (cased)</formula><p>Following hyperparameter configuration is used: num of epochs=20, learning rate=2 * 10 −5 , training batch size=32, max sequence length=32.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results and Analysis</head><p>We implement the NER and RE tasks using Tensorflow based on the BERT neural architecture and run experiments on an AWS p2.xlarge GPU instance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>NER Results</head><p>We follow the practice in the SemEval-2013's Drug-Drug Interactions task and evaluate NER performance by 3 matching standards: strict, exact, and partial (Segura-Bedmar, Martínez, and Herrero-Zazo 2013). The strict matching evaluates both boundary and entity type of entity phrases; the exact matching evaluates the exact boundary regardless of entity type; and the partial matching measures the partial boundary of entities regardless of entity type (thus the most lenient). We calculate precision(P)/recall(R)/f1-score(F) for the three evaluation types, and additionally, we also report macro average P/R/F results. The results are shown in Table <ref type="table" target="#tab_2">3</ref>.</p><p>In our experiments, fine-tuning the pre-trained BioBERT model achieves slightly better performance than its BERT counterparts. For example, BioBERT, Sof tmax has f1score of 70.61, better than BERT base,uncased , Sof tmax's 69.80 and BERT base,cased , Sof tmax's 69.68. Similarly, BioBERT, BiLST M + CRF holds a higher f1score than BERT base,uncased , BiLST M + CRF and BERT base,cased , BiLST M + CRF for all the four evaluation types.</p><p>When comparing the cased and uncased strategies, we notice that the uncased pre-trained models outperform the cased ones with the same neural architecture: e.g. BERT base,uncased , BiLST M + CRF achieves f1-score of 70.28 for the strict evaluation type, higher than the f1score of 69.89 from BERT base,cased , BiLST M + CRF . This finding suggests that applying lowercase to preprocessing actually enhances performance slightly, which is counter-intuitive for NER tasks as the entities are often casesensitive. Meanwhile, we also find that the two BioBERT models, which are cased, perform better than their peer models of the same neural architecture. But since BioBERT only offers the cased option, we cannot discern the relative contribution from being cased in the BioBERT pre-trained model.</p><p>From Table <ref type="table" target="#tab_2">3</ref>, it is not surprising that for a given model, the partial evaluation usually holds the highest score, followed by exact, strict, and macro. Another observation is that when we loosen evaluation type from strict to exact, i.e. focusing on entity boundary without penalizing entity type errors, the performance is improved but still remains in the 73.15-74.06 range, suggesting that the experimented BERT based models fail identify entity boundary very precisely, which can be of interest for future investigation.</p><p>In our experiments with simple Softmax as the task layer, we observe more boundary detection errors. This in fact is the motivation for us to add the BiLSTM+CRF layers as the NER task layer. However, the results show that given the same pre-trained model configuration, it is debatable that BiLSTM+CRF could consistently improve performance. For example, BioBERT, BiLST M + CRF slightly outperforms BioBERT, Sof tmax in strict matching precision and f1-score, but BioBERT, Sof tmax beats BioBERT, BiLST M + CRF in strict matching recall.</p><p>We also find that the recall score is consistently higher than the precision score for all models at all evaluation stan-dards, indicating that the models tend to have more false positive predictions than false negative predictions. The macro scores show lower performance than strict/exact/partial because it simply averages the performance of different entity types and some small-sample entity types have lower performance due to lack of training data.</p><p>Overall, BioBERT, BiLST M + CRF produces the best precision and f1-scores for all the four evaluation types whereas BioBERT, Sof tmax holds the highest recalls. These results suggest that fine-tuning BioBERT lends itself better to the NER tasks in the clinical trial domain, which seems intuitive. But for task layer, the choice between Softmax and BiLSTM+CRF does not significantly affect the performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RE Results</head><p>RE evaluation results are shown in Table <ref type="table" target="#tab_3">4</ref>, in which we report micro/macro/weighted precision(P), recall(R), and f1score(F). From the above performance chart, we find that BERT base,uncased has the highest f1-scores, whereas BERT base,cased has the lowest. Comparing BERT base,cased and BioBERT indicates that BioBERT can help with performance slightly, at least for this cased scenario. On the other hand, BERT base,uncased noticeably improves over its cased peer, BERT base,cased , by a 4.33 percentage margin. Therefore, just like the NER task, the RE task is also case insensitive, probably because uncased situations reduce vocabulary variations in processing. We also observe that recall and precision are close to each other with precision slightly higher for the macro evaluation, but on the contrary, precision is slightly higher than recall for micro and weighted. These observations suggest that the model has higher precision score than recall score in classes with less samples, such as 'is located' and 'is negated' (in Table <ref type="table" target="#tab_0">1</ref>). And when doing macro evaluation, the contribution from the smaller classes becomes more visible.</p><p>Overall, the BERT base,uncased model prevails -it outperforms the other two models on each evaluation type and measures. For example, it has f1-score of 78.79 for micro, compared to BERT base,cased 's 74.46 and BioBERT 's 74.60. These results indicate again that the lowercasing preprocessing helps the NLP tasks even in the clinical trial domain where many terms are represented in capital letters. Secondly, BioBERT beating BERT base,cased with a small margin may suggest that although pre-training in the biomedical domain could bring in some benefit, it is still not specific enough for clinical trials. Since there is no uncased BioBERT pre-trained model available, it is unclear whether training on biomedical corpus with lowercasing preprocessing could synergistically improve the performance. Considering the big improvement from BERT base,cased to BERT base,uncased , we believe the uncased scenario of current BioBERT model is worth future investigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Error Analysis</head><p>We present and inspect NER prediction results from one of the models (BERT base,uncased , Sof tmax) in a Brat server, an open source tool that can help visualize annotation results using color bars <ref type="bibr" target="#b18">(Stenetorp et al. 2012)</ref>. We overlay human and prediction annotations together in Brat to facilitate the comparison.</p><p>The NER errors can be broadly categorized into boundary errors and entity type errors, as reflected by the four evaluation types. For boundary errors, one pattern is that BERT tends to mis-annotate some words inside a multi-word phrase. For example, as shown in Figure <ref type="figure" target="#fig_4">4</ref>, "at least a 3 month" is one temporal constraint entity, but the NER model only captures "at"+"3 month" while misses the words in the middle ("least a"). This reflects a potential problem with BERT NER models: although it can assign entity classes relatively well, lack of structure enforcement on its output layer may possibly cause the inconsistent label within a full phrase. In some cases, the NER model captures longer entities than the human annotator. For example, the model annotates "[cardiac mechanical assist device]|Device"; whereas the gold standard annotates the same phrase as "[cardiac]|AnatomicLocation" + "[mechanical assist <ref type="bibr">device]</ref>|Device". In some other cases, the situation reverses and the NER model chunks one entity in the gold standard into multiple ones. For example, "[non-steroidal anti-inflammatory drugs]|Drug" is chunked into a Qualifier/Modifier and a drug: "[non-steroidal]|Qualifier/Modifier [anti-inflammatory drugs]|Drug". The boundary merging and chunking issues, as illustrated by these two examples, occur frequently with the Qualifier/Modifier class as it is arguable that a complex term can be annotated by one whole entity or as a Qualifier/Modifier plus an entity.</p><p>For the entity type error, we observe a few cases, such as "urinalysis-Procedure type" is predicted as an Observation entity, and "gastrointestinal motility-Condition type" is predicted as Drug. The type errors occur less frequently than boundary errors according to our manual inspection.</p><p>For the RE task, we manually screen the predictions from the BERT base,uncased , Sof tmax model against the gold standards. We first observe that the NER boundary errors can propagate to the RE task. Note that we only use named entity positions but not types in the RE task, and therefore only NER boundary errors can affect the RE performance. For example, "Transient neurologic deficits", annotated as one Condition entity in the gold standard, is split into "Transient-Qualifier/Modifier" and "neurologic deficits-Condition", thus causing the RE task to predict a 'modified by' relation between the two entities which actually does not exist in the gold standard. Another major category of RE classification error is that a number of actual relations misclassified as 'no relation', while misclassification between other classes is much less frequent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion and Future Work</head><p>In this study, we focus on extracting clinically relevant terms and relations from protocol eligibility criteria by applying pre-trained transformer deep learning NLP models for NER and RE tasks. We experiment with several configurations of the pre-trained BERT models and report our results and findings.</p><p>Our results demonstrated the effectiveness of NLP models in processing clinical trial protocols. Despite of the fact that the processed texts are unique with specific clinical and medical terms and logical relations, BERT and BioBERT models returned acceptable performances. We also find that in general, BioBERT, which is pre-trained on biomedical corpus, outperforms BERT, which is pre-trained on general domain corpus. This agrees with the general understanding of the importance of domain-specific training for achieving higher model performance in domain-specific tasks.</p><p>A surprising finding is that even though the clinical trial domain largely contains capitalized terminologies, lowercasing preprocessing improves the performances of both NER and RE tasks. Our hypothesis is that maintaining less token variation (i.e. lowercasing has less variation) is more important than maintaining casing for these tasks.</p><p>It is also worth noting that there are rooms to improve the quality of our gold standard. Due to the complex nature of the protocols that cover many different sub-domains in biomedical and clinical sciences such as therapeutic areas, even human experts can easily make mistakes or be inconsistent. In fact, we found many cases that the model predictions are in fact correct, although different from the gold standard. To address this annotation quality issue, we employed an iterative annotating pipeline that asks human experts to verify the pre-annotated documents by the NLP models. We anticipat that this practice can help partly address this issue.</p><p>We believe that the model performance can be further improved. To do that, we can further explore in several directions. The first approach is to train a biomedical BERT model using a domain-specific vocabulary from scratch. BERT model handles tokens by splitting them into subwords using a predefined subword vocabulary. For example, 'myocarditis' and 'pericarditis', two heart conditions sharing the same suffix 'carditis', are however represented as 'my'+'##oca'+'##rdi'+'##tis' and 'per'+'##ica'+'##rdi'+'##tis' respectively. This way of tokenization does not represent the suffix in a biomedically meaningful way due to the lack of biomedical subwords in the vocabulary. We assume subwords generated from the biomedical domain reflecting word root patterns can further enhance the word representation for BERT models and thus improve downstream task performance. We can train a BERT model from scratch using a biomedical corpus and a biomedical subword vocabulary.</p><p>The second strategy is to deploy multi-task co-training: since NER and RE tasks are dependent on each other, namely, knowing one task's output can facilitate the other task's, and therefore joint learning on them is expected to improve performances for both.</p><p>Our third strategy for future improvement is to reduce unnecessary relations currently predicted from the RE model. Our current greedy prediction pipeline enumerates all possible entity pairs that results in an unnecessarily large testing base set. One way to address this issue is to consider dependency parsing information, which can be used to indicate whether two terms has dependency relations to prune unnecessary entity pairs. The extracted information from the NER and RE tasks has the great potential of assisting drug development business especially for study feasibility analysis. The derived information is the basis for a local knowledge graph for the protocols and a global graph when merging with external structured information such as drug ontologies. In conclusion, this is our first step towards a greater mission to apply deep learning to business cases in drug development, and the subsequent analysis based on the derived graph can even further enhance our contribution and insights to this research area.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Structured information extracted from protocol eligibility criteria.</figDesc><graphic coords="2,73.63,54.00,199.24,107.78" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Neural architecture of the BERT NER task (with Softmax as the task layer).</figDesc><graphic coords="4,54.00,54.00,206.45,230.27" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>uncased , Sof tmax: BERT base uncased pretrained model, softmax as NER task layer • BERT base,cased , Sof tmax: BERT base cased pretrained model, softmax as NER task layer • BioBERT, Sof tmax: BioBERT pre-trained model (cased), softmax as NER task layer • BERT base,uncased , BiLST M + CRF : BERT pretrained uncased model, BiLSTM+CRF as NER task layer • BERT base,cased , BiLST M + CRF : BERT base pretrained cased model, BiLSTM+CRF as NER task layer • BioBERT, BiLST M + CRF : BioBERT pre-trained model (cased), BiLSTM+CRF as NER task layer The layout of the BERT NER neural architecture is shown in Figure 2.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Neural architecture of the BERT RE task (with Softmax as the task layer).</figDesc><graphic coords="4,319.50,54.00,233.39,224.96" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: An example of the NER engine mis-annotating tokens within a phrase.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Train and test data counts for the NER task.</figDesc><table><row><cell>Entity</cell><cell>Train</cell><cell>Test</cell></row><row><cell>Condition</cell><cell cols="2">12,682 8,537</cell></row><row><cell>Observation</cell><cell>7,309</cell><cell>5,218</cell></row><row><cell>Procedure</cell><cell>3,406</cell><cell>2,234</cell></row><row><cell>Device</cell><cell>221</cell><cell>140</cell></row><row><cell>Drug</cell><cell>7,793</cell><cell>5,858</cell></row><row><cell cols="2">Investigational product 329</cell><cell>224</cell></row><row><cell>Event</cell><cell>2,430</cell><cell>1,625</cell></row><row><cell>Refractory condition</cell><cell>381</cell><cell>278</cell></row><row><cell>Demographics</cell><cell>498</cell><cell>381</cell></row><row><cell>Measurement</cell><cell>4,540</cell><cell>3,344</cell></row><row><cell>Temporal constraints</cell><cell>6,968</cell><cell>4,589</cell></row><row><cell>Qualifier/modifier</cell><cell>7,853</cell><cell>5,196</cell></row><row><cell>Anatomic location</cell><cell>427</cell><cell>223</cell></row><row><cell>Negation cue</cell><cell>921</cell><cell>615</cell></row><row><cell>Permission cue</cell><cell>1,236</cell><cell>869</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Train and test data counts for the RE task.</figDesc><table><row><cell>Relation</cell><cell>Train</cell><cell>Test</cell></row><row><cell>is negated</cell><cell>703</cell><cell>468</cell></row><row><cell>is permitted</cell><cell>1,009</cell><cell>673</cell></row><row><cell>modified by</cell><cell>5,715</cell><cell>3,810</cell></row><row><cell>has value</cell><cell>3,326</cell><cell>2,218</cell></row><row><cell cols="2">has temporal constraint 6,169</cell><cell>4,112</cell></row><row><cell>is located</cell><cell>215</cell><cell>143</cell></row><row><cell>specified by</cell><cell>3,729</cell><cell>2,486</cell></row><row><cell>no relation</cell><cell cols="2">10,616 7,078</cell></row><row><cell>total count</cell><cell cols="2">31,482 20,988</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>NER task results: Precision(P), Recall(R), F1 Score(F).</figDesc><table><row><cell>NER Model</cell><cell>Type</cell><cell>P</cell><cell>R</cell><cell>F</cell></row><row><cell></cell><cell>strict</cell><cell cols="3">67.76 71.98 69.80</cell></row><row><cell>BERT base,uncased ,</cell><cell>exact</cell><cell cols="3">71.02 75.44 73.16</cell></row><row><cell>Sof tmax</cell><cell cols="4">partial 75.28 79.96 77.55</cell></row><row><cell></cell><cell cols="4">macro 62.65 66.83 64.63</cell></row><row><cell></cell><cell>strict</cell><cell cols="3">67.82 71.66 69.68</cell></row><row><cell>BERT base,cased ,</cell><cell>exact</cell><cell cols="3">71.19 75.22 73.15</cell></row><row><cell>Sof tmax</cell><cell cols="4">partial 75.41 79.68 77.49</cell></row><row><cell></cell><cell cols="4">macro 63.04 66.37 64.63</cell></row><row><cell></cell><cell>strict</cell><cell cols="3">68.73 72.60 70.61</cell></row><row><cell>BioBERT,</cell><cell>exact</cell><cell cols="3">71.87 75.91 73.83</cell></row><row><cell>Sof tmax</cell><cell cols="4">partial 75.99 80.26 78.06</cell></row><row><cell></cell><cell cols="4">macro 62.97 67.27 65.03</cell></row><row><cell></cell><cell>strict</cell><cell cols="3">68.59 72.06 70.28</cell></row><row><cell>BERT base,uncased ,</cell><cell>exact</cell><cell cols="3">71.85 75.49 73.62</cell></row><row><cell>BiLST M + CRF</cell><cell cols="4">partial 76.10 79.95 77.98</cell></row><row><cell></cell><cell cols="4">macro 63.43 66.45 64.88</cell></row><row><cell></cell><cell>strict</cell><cell cols="3">68.09 71.80 69.89</cell></row><row><cell>BERT base,cased ,</cell><cell>exact</cell><cell cols="3">71.34 75.22 73.23</cell></row><row><cell>BiLST M + CRF</cell><cell cols="4">partial 75.55 79.67 77.56</cell></row><row><cell></cell><cell cols="4">macro 62.68 66.41 64.45</cell></row><row><cell></cell><cell>strict</cell><cell cols="3">69.12 72.47 70.76</cell></row><row><cell>BioBERT,</cell><cell>exact</cell><cell cols="3">72.35 75.85 74.06</cell></row><row><cell>BiLST M + CRF</cell><cell cols="4">partial 76.55 80.25 78.36</cell></row><row><cell></cell><cell cols="4">macro 63.79 67.44 65.54</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>RE task results: Precision(P), Recall(R), F1 Score(F).</figDesc><table><row><cell>RE Model</cell><cell>Type</cell><cell>P</cell><cell>R</cell><cell>F</cell></row><row><cell></cell><cell>micro</cell><cell cols="3">78.10 79.49 78.79</cell></row><row><cell>BERT base,uncased</cell><cell>macro</cell><cell cols="3">76.43 76.22 76.24</cell></row><row><cell></cell><cell cols="4">weighted 78.03 79.49 78.72</cell></row><row><cell></cell><cell>micro</cell><cell cols="3">73.61 75.33 74.46</cell></row><row><cell>BERT base,cased</cell><cell>macro</cell><cell cols="3">69.56 68.63 68.80</cell></row><row><cell></cell><cell cols="4">weighted 73.41 75.33 74.27</cell></row><row><cell></cell><cell>micro</cell><cell cols="3">74.37 74.83 74.60</cell></row><row><cell>BioBERT</cell><cell>macro</cell><cell cols="3">70.30 68.34 69.08</cell></row><row><cell></cell><cell cols="4">weighted 74.17 74.83 74.44</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Alsentzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Murphy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Boag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-H</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Naumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mcdermott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Badaskar</surname></persName>
		</author>
		<author>
			<persName><surname>Beltagy</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.03323</idno>
		<idno>arXiv:1903.10676</idno>
		<title level="m">Scibert: Pretrained contextualized embeddings for scientific text</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Lo</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2007">2019. 2007. 2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
	<note>Literature review for Language and Statistics II</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Nymble: a high-performance learning name-finder</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Bikel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Schwartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weischedel</surname></persName>
		</author>
		<idno>arXiv preprint cmp-lg/9803003</idno>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">W</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carbonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1901.02860</idno>
		<title level="m">Transformer-xl: Attentive language models beyond a fixed-length context</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Universal language model fine-tuning for text classification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Howard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ruder</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1801.06146</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Bidirectional lstm-crf models for sequence tagging</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1508.01991</idno>
	</analytic>
	<monogr>
		<title level="m">Speech &amp; language processing</title>
				<imprint>
			<publisher>Pearson Education India</publisher>
			<date type="published" when="2000">2015. 2000</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Conditional random fields: Probabilistic models for segmenting and labeling sequence data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lafferty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">C</forename><surname>Pereira</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1603.01360</idno>
	</analytic>
	<monogr>
		<title level="m">Neural architectures for named entity recognition</title>
				<imprint>
			<date type="published" when="2001">2001. 2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
	<note>ICML proceedings</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Banner: an executable survey of advances in biomedical named entity recognition</title>
		<author>
			<persName><forename type="first">R</forename><surname>Leaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gonzalez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Biocomputing</title>
				<imprint>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="page" from="652" to="663" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Biobert: a pre-trained biomedical language representation model for biomedical text mining</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yoon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">H</forename><surname>So</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">Bioinformatics</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Roberta: A robustly optimized bert pretraining approach</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1907.11692</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Long short-term memory rnn for biomedical named entity recognition</title>
		<author>
			<persName><forename type="first">C</forename><surname>Lyu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ji</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC bioinformatics</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">462</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hovy</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1603.01354</idno>
		<title level="m">End-to-end sequence labeling via bi-directional lstm-cnns-crf</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Maximum entropy markov models for information extraction and segmentation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Freitag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">C</forename><surname>Pereira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICML</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page" from="591" to="598" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Iyyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gardner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.05365</idno>
		<title level="m">Deep contextualized word representations</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Language models are unsupervised multitask learners</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">OpenAI Blog</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">8</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Matena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
		<idno>arXiv e-prints</idno>
		<title level="m">Exploring the limits of transfer learning with a unified text-to-text transformer</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Text chunking using transformation-based learning</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Ramshaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Marcus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Natural language processing using very large corpora</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="157" to="176" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">from biomedical texts (ddiextraction</title>
		<author>
			<persName><forename type="first">I</forename><surname>Segura-Bedmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Martínez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Herrero-Zazo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Semeval-2013 task 9: Extraction of drug-drug interactions</title>
				<imprint>
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">brat: a web-based tool for NLPassisted text annotation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Stenetorp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Topić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ohta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ananiadou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics</title>
				<meeting>the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics<address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="102" to="107" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="5998" to="6008" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Cohen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1603.06270</idno>
	</analytic>
	<monogr>
		<title level="m">Multitask cross-lingual sequence tagging from scratch</title>
				<imprint>
			<date type="published" when="2016">2016. 2016. 2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Crite-ria2query: a natural language interface to clinical databases for cohort definition</title>
		<author>
			<persName><forename type="first">C</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">B</forename><surname>Ryan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hardin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Makadia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="294" to="305" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
