<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Using Embedding-based Metrics to expedite patients recruitment process for clinical trials</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Houssein</forename><surname>Dhayne</surname></persName>
							<email>houssein.dhayne@net.usj.edu.lb</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Engineering</orgName>
								<orgName type="institution">ESIB Saint Joseph University Beirut</orgName>
								<address>
									<country key="LB">Lebanon</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Rima Kilany Faculty of Engineering</orgName>
								<orgName type="institution">ESIB Saint Joseph University Beirut</orgName>
								<address>
									<country key="LB">Lebanon</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Using Embedding-based Metrics to expedite patients recruitment process for clinical trials</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">DD852BD938111A146C0205777FA73C4F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>NLP</term>
					<term>NLI</term>
					<term>EMR</term>
					<term>Automated clinical trial eligibility screening</term>
					<term>BioBERT</term>
					<term>Sentence similarity</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Despite the unprecedented volumes of Electronic Medical Records (EMRs) generated daily across healthcare facilities, the ability to leverage these data for patient participation in clinical trial remains overwhelmingly unfulfilled. The reason behind this is that matching patient information to the eligibility criteria for clinical trials is a manual, effort-consuming process. Therefore, automating this process is an essential step in improving the number of patients participating in clinical research. To address this issue, we propose a novel framework for automated patients to clinical trials matching. The matching process is based on measuring the similarity score between phrases extracted from patient medical records and the eligibility criterion for a trial.</p><p>Our solution is based on a combination of NLP techniques and modern deep learning-based NLP models. In this context, we follow pre-training and transfer learning approaches to help the model learn task-specific reasoning skills. Additionally, we perform supervised fine-tuning on large Medical Natural Language Inference (MedNLI) and Semantic Textual Similarity (STS-B) datasets. The matching process was performed at semantic phrases level by converting patient information and trial criteria into vector representations. We then used a scoring function that combined cosine similarity and scaling normalization to identify potential patient-trial matches. The experimental results have shown that our framework is highly effective in sorting out patients by their similarity scores.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>The widespread adoption and use of electronic medical records (EMRs), together with the development of advanced artificial intelligence models, offer remarkable opportunities for improving the clinical research sector <ref type="bibr" target="#b0">[1]</ref>. Furthermore, EMRs offer a wide range of potential uses in clinical trials such as facilitating the clinical trial feasibility assessment and patient recruitment, as well as obtaining main patient health information and medical history prior to their screening visit. The latter is a critical step in reducing the costs and duration of clinical trials <ref type="bibr" target="#b1">[2]</ref>. Additionally, linking EMRs with clinical trials has been shown to increase patient recruitment rate <ref type="bibr" target="#b2">[3]</ref>. However, there are many barriers to overcome in order to use EMRs for clinical trials.</p><p>Even though EMRs were designed to record information in a structured format, such as procedure information, diagnosis codes, drug prescriptions, and lab results, free text remains the most flexible way for physicians to express case nuances and clinical reasoning <ref type="bibr" target="#b3">[4]</ref>. These free texts usually contain important facts about patients, but they are rarely available for formal queries <ref type="bibr" target="#b4">[5]</ref>.</p><p>On the other hand, eligibility criteria for a clinical trial describes the characteristics of patients who are qualified to participate in the trial. Each criterion is usually expressed as a descriptive text and specified in the form of inclusion and exclusion criteria. Therefore, free text criteria can not always be transformed into structured data representations.</p><p>Authors in <ref type="bibr" target="#b5">[6]</ref> confirmed that using only structured data from the EMR is insufficient in resolving eligibility criteria for patient recruitment in clinical trials, and that unstructured data is essential to resolve 59% to 77% of the trial criteria.</p><p>However, matching clinical notes with eligibility criteria is still a manually performed task, which makes it an expensive process in terms of time and effort. This slows down clinical trials and may delay new drugs from benefiting patients. As a consequence, it might entail the loss of human lives that otherwise would have been able to benefit from new medication. For these reasons, automated matching of clinical notes with eligibility criteria in the eligibility screening workflow would help overcome the bottlenecks of pre-screening practices in a trial setting.</p><p>To tackle the above challenge efficiently, we need to execute a matching process at a semantic sentence level, rather than by just checking for the presence or absence of a lexical criterion. The investigation of the potential use of modern deep learningbased NLP(Natural Language Processing) models, led us to propose a framework that would automate the evaluation of the eligibility of patients to be candidates for a relevant clinical trial. As a first step, the framework splits patient clinical report and clinical trial sentences into comparatively basic phrase units. Secondly, it classifies the phrases into various clinical categories (diagnosis, drug, procedure, observation). Thirdly, the framework converts candidate phrases into vector representations using an appropriate deep learning-based NLP model. Finally, it calculates a semantic matching score between patients and a clinical trial by using a combination of cosine similarity alongside a scaling normalization method.</p><p>This paper is organized as follows: In section II, we expose the problem definition and review the related works. In sec- tion III, we describe our framework and illustrate the different challenges. The evaluation of the results and outcomes is discussed in section IV. Finally, we conclude this paper in section V.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Problem definition</head><p>According to our approach, the problem definition of patient-trial matching can be described as follows:</p><p>Finding clinical trial participants is the task of matching Patient P i (P i 2 EM R) represented by a Discharge Summary DS i to a Clinical Trial CT represented by an Eligibility Criteria EC. Formally, the solution to this task is to find the top-K highest-values of function M which computes the matching score denoted by: M(P i , CT ) = v which represents the score of matching patient P i to a CT . This list of the top-K highest-scores reduces the overall number of patients that will need to be screened by clinicians in order to identify eligible patients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Data representation 1) Clinical trial:</head><p>A clinical trial is a type of research that provides a longstanding foundation in the practice of medicine and the evaluation of new medical treatments. Each trial has eligibility criteria describing the characteristics according to which a patient or participant must meet all inclusion criteria and none of the exclusion criteria. In this respect, the criteria differ from study to study. Authors in <ref type="bibr" target="#b6">[7]</ref> analysed 1000 eligibility criteria and showed that 23% of the criteria are simple, or can be reduced to simple criteria, and that 77% of the criteria remain complex to evaluate. Therefore, a formally computable representation of eligibility criteria would require natural language processing techniques as part of automated screening for patient eligibility.</p><p>2) Patient medical records: An EMR typically collects various types of patient information, including patient discharge summaries, prior diagnoses, radiology reports, medication history, and so on. Hospital discharge summaries are a physicianauthored synopsis of a patient's hospital stay, which serve as the main documents communicating a patients care plan to the post-hospital care team <ref type="bibr" target="#b7">[8]</ref>. Discharge summaries are organized in several sections. These sections usually include past medical history and history of present illness as shown in fig. <ref type="figure" target="#fig_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Related work</head><p>In the recent past, several projects have developed tools and technologies for automated trial-patient matching. Milian et al. <ref type="bibr" target="#b8">[9]</ref> used a template-based formalism to extract and represent the semantics of the trial criteria in order to improve their comparability. Patel et al. <ref type="bibr" target="#b9">[10]</ref> formulated the matching process as a semantic retrieval problem by expressing clinical trial criterion in the form of semantic query, which a reasoner can then use with a formal medical ontology -SNOMED CT to retrieve eligible patients. Other works such as EliIE <ref type="bibr" target="#b10">[11]</ref> and Criteria2Query <ref type="bibr" target="#b11">[12]</ref> have focused on identifying standardized medical entities in eligibility criteria using machine learning approaches, the extracted entities being then used to query patient data. Shivade et al. <ref type="bibr" target="#b12">[13]</ref> constructed an annotated dataset that determined whether the medical note contains text that meets a criterion or not. Then, they implemented two lexical methods and two semantic methods to determine a relevance score of each sentence with a criterion statement, and found that semantic methods gave better results than lexical methods. Ni et al <ref type="bibr" target="#b13">[14]</ref> evaluated a system using a combination of NLP, information retrieval and machine learning methods to identify a cohort of patients for clinical trial eligibility pre-screening. Their system relies on both structured data and clinical notes from EMRs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. FRAMEWORK OVERVIEW</head><p>In this section, we describe the framework we propose for automating the matching process between patients and a clinical trial. This framework takes into account the following different challenges; (i) In order to treat complex sentences in patient's data as well as in clinical trials, we break down paragraphs into sentences and complex sentences are then parsed into phrases. These phrases are the basic units for matching. (ii) To avoid costly comparisons without fault dismissals, phrases are partitioned using classification methods, which limits the number of pairs to match. (iii) To match phrases, we represent them in the form of distributed vectors, which enables calculating similarity for formally different but semantically related phrases. Fig. <ref type="figure" target="#fig_1">2</ref> shows an overview of our Patients to Clinical Trial matching framework. Given a Clinical Trial CT and set of Patients P, our task is to calculate a Matching score M(P i , CT ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Paragraph and sentence decomposition</head><p>In order to measure the similarity between two sentences, we have to deal with a simple sentence representing a linguistically-meaningful unit. This process requires segmenting both paragraph-level and sentence-level structures into phrase-level structures. According to <ref type="bibr" target="#b14">[15]</ref>, segmentation of paragraphs and sentences is the process of parsing the longer processing units, consisting of one or more words, to further processing stages such as part-of-speech parsers, morphological analyzers, etc.</p><p>In our model, we handle each phrase as a primitive semantic unit and find matching phrases between patient and clinical We used paragraph and sentence segmentation of MetaMap <ref type="bibr" target="#b15">[16]</ref>. MetaMap was provided by the National Library of Medicine (NLM) to map Medical Language Processor (MLP) text to the UMLS Metathesaurus concepts <ref type="bibr" target="#b16">[17]</ref>. MetaMap breaks text into paragraphs, sentences, and then phrases. Table I presents a simple example of segmenting sentences into phrases. The first refers to the eligibility criteria (NCT03484780) and the second illustrates an example from a patient discharge summary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Phrases classification</head><p>A discharge summary report contains information about different topics. Therefore, the large number of heterogeneous phrases extracted from the patient reports may affect the efficiency and effectiveness of pairwise phrase matching <ref type="bibr" target="#b17">[18]</ref>.</p><p>To minimize the number of required comparisons, we applied a filtering methodology. The latter aims to filter all the classes of phrases that do not correspond to a given class, which limits the number of pairs to match.</p><p>Data classification techniques could support achieving this filtering by separating phrases extracted from patient data and clinical trial into different medical categories. This classification filters-out non-matching pairs prior to verification, which increases the efficiency of phrases similarity matching with high precision and without sacrificing recall.</p><p>In our study, a total of 1500 eligibility criteria were extracted from a Clinical Trials database 1 and were manually labelled by a certified nurse and a data science master student according to four classes (diagnosis, drug, procedure, observation).</p><p>In this work, we have empirically explored and compared four methods widely used in classification as our baseline: SVM, CNN, LSTM, C-LSTM <ref type="bibr" target="#b18">[19]</ref>, in order to identify the ones with the best performance. For SVM and CNN models, we initialized word embeddings by the average of the word embedding over all words in the sentence via PubMed-and-PMC-w2v <ref type="bibr" target="#b19">[20]</ref>.</p><p>Our experiment indicates that CNN + w2v model has the best prediction performance in comparison to the other models 1 https://clinicaltrials.gov/ selected in our exploration, with a Precision of 0.87, a Recall of 0.88, and a F1-score of 0.875. We therefore adopted CNN + PubMed-and-PMC-w2v to perform this classification task and were able to categorize the phrases into the four pre-mentioned categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Phrase vector representations</head><p>The purpose of this work is to allow the matching of patients data and clinical trials by comparing unstructured data from both datasets. Our claim is that by measuring the similarity of primitive semantic medical units (medical phrases) of a patient's Discharge Summary and Eligibility Criteria, we can generate a score value supporting the matching task.</p><p>There are plenty of measures of semantic similarity between sentences used in NLP. Unsupervised and supervised methods have been used to calculate the semantic similarity between two sentences in the biomedical domain <ref type="bibr" target="#b20">[21]</ref>. Recently, a number of novel approaches have been proposed to address this problem by producing sentence vectors <ref type="bibr" target="#b21">[22]</ref>. As an example, Neural sentence-embedding methods <ref type="bibr" target="#b22">[23]</ref> have been shown to outperform traditional approaches, such as TF-IDF and word overlap based measures.</p><p>1) Universal sentence embeddings: The concept of universal sentence embeddings has grown in popularity as it leverages models trained on large text corpora. These pretrained models can be used in a wide range of downstream tasks, such as providing versatile sentence-embedding models that convert sentences into vector representations. Notable works include ELMo <ref type="bibr" target="#b23">[24]</ref>, GPT <ref type="bibr" target="#b24">[25]</ref>, and BERT <ref type="bibr" target="#b25">[26]</ref>.</p><p>2) BioBERT: BERT (Bidirectional Encoder Representations from Transformers) is a neural network language model trained on plain text for masked word prediction and next sentence prediction tasks. BERT applies multi-layer bidirectional transformer encoder with self-attention. According to <ref type="bibr" target="#b26">[27]</ref>, BERT overall achieved state-of-the-art performances in many Natural Language Processing tasks and was significantly better than other models. However, compared against more recent models, XLNet <ref type="bibr" target="#b27">[28]</ref> outperforms BERT and achieves better prediction metrics on the GLUE benchmark <ref type="bibr" target="#b28">[29]</ref>, but is not yet widely used in the medical field. Applying the same architecture as BERT, Lee et al. <ref type="bibr" target="#b29">[30]</ref> proposed the BioBERT language model trained on biomedical corpora including PubMED and PMC. The BioBERT model showed promising results in the biomedical domain.</p><p>3) Phrase embedding: In this respect, to generate contextrich phrase embeddings, we chose BioBERT as the language model in conjunction with the Bert-as-service library <ref type="bibr" target="#b30">[31]</ref>. Bert-as-service is a feature extraction service based on BERT which uses two strategies to derive a fixed-sized vector. In the default strategy, Bert-as-service does average pooling of all of the tokens of second-to-last hidden layer, while the second uses the output of the special CLS token and is recommended only after fine-tuning BERT on a downstream task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Phrases Similarity Measures</head><p>The similarity between two vectors can be evaluated using various similarity measures such as Cosine similarity, Eu-  </p><formula xml:id="formula_0">if cos(A A A, B B B) &gt; cos(A A A, C C C)<label>(2)</label></formula><p>then A A A is more similar to B B B than C C C.</p><p>Whereas a pre-trained BioBERT knowledge often shows a good performance for certain tasks, as we shall see later on, this prior knowledge is not sufficient to compute the similarity of sentences based on their embeddings. Indeed, we first tried to compute the cosine similarity of sentences, annotated by experts, using extracted embedding from pre-trained BioBert, without any fine-tuning. The result of the comparison was unsatisfactory and unacceptable (table <ref type="table" target="#tab_2">II</ref>). The most significant sentence is the exact opposite, for example; the most similar sentence of "History of CVA" was "patient has normal brain MRI" with similarity value of 0.91 which was annotated by experts as "contradiction", and the "Entailment" sentence "patient has history of stroke" appears in the second place with similarity value of 0.89. Therefore, foregoing experiments reinforced our belief that it is necessary to fine-tune BioBERT on our downstream task.</p><p>1) Supervised Fine-tuning: Transfer learning is the process of extending a pre-trained model by leveraging data from an additional domain for a better model generalization <ref type="bibr" target="#b31">[32]</ref>. The most common transfer learning techniques in NLP is finetuning. Fine-tuning involves copying the weights from a pretrained network and tuning them using labeled data from the downstream tasks. BERT is a fine-tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level tasks, with pre-trained representations reducing thus the need for many heavily-engineered taskspecific architectures.</p><p>In the context of natural language understanding (NLU) technology, comparing the relationship between two sentences is based on several downstream tasks such as Natural Language Inference (NLI) and Semantic Textual Similarity (STS) <ref type="bibr" target="#b28">[29]</ref>. Besides that, authors in <ref type="bibr" target="#b32">[33]</ref> have shown that fine-tuning BERT on NLI and STS datasets creates sentence embeddings which achieve an improvement of 11.7 points compared to InferSent <ref type="bibr" target="#b33">[34]</ref> and 5.5 points compared to the Universal Sentence Encoder <ref type="bibr" target="#b21">[22]</ref>. In this context, we first fine-tuned BioBERT on STS-B dataset that generated our BioBERT-based model. We then further fine-tuned on MedNLI dataset. We used the fine-tuning classifier from BERT systems <ref type="bibr" target="#b34">[35]</ref>.</p><p>• MedNLI <ref type="bibr" target="#b35">[36]</ref>: is a large, publicly available, expert annotated dataset drawn from the medical history section of MIMIC-III. MedNLI includes a set of clinical sentence pairs(14,049 pairs). They were annotated with one of three classes: entailment, contradiction, and neutral. • STS-B <ref type="bibr" target="#b36">[37]</ref>: is a collection of sentence pairs selected from news headlines. The dataset consists of paired sentences (8,628 pairs) labelled by humans with a similarity score of 1 to 5 denoting how similar the two sentences are in terms of semantic meaning.</p><p>2) Evaluation of fine-tuned BioBERT: We evaluated the new BioBERT model by computing the cosine similarity between the phrase embeddings. We observed that the model, was not just able to rank phrases in terms of similarity, but also gave a more appropriate cosine value. A representative sample of the results is depicted in Table <ref type="table" target="#tab_2">II</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Matching Patients to Clinical Trials</head><p>After fine-tuning the BioBERT model for optimized cosine similarity and creating both Discharge Summary and Clinical Trial phrases embeddings, we proceeded to find Clinical Trial participants from an EMR dataset.</p><p>Formally, we denote: History of hypercholesterolemia and peptic ulcer disease s/p gastric bypass some years ago was involved in a lowspeed MVC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fine-tuned BioBERT</head><p>• DS i = {ph i,1 , ph i,2 , ..., ph i,r } as the phrases extracted from Discharge Summary of patient P i . • IEC = {iec 1 , iec 2 , ..., iec p } as the phrases extracted from Inclusion Eligibility Criteria. • EEC = {eec 1 , eec 2 , ..., eec q } as the phrases extracted from Exclusion Eligibility Criteria.</p><formula xml:id="formula_1">• EC = {ec 1 , ec 2 , ..., ec l } = IEC [ EEC | l = p + q as</formula><p>all phrases extracted from Eligibility Criteria.</p><formula xml:id="formula_2">• S 2 [0, 1]</formula><p>n⇤l as the cosine Similarity matrix, where n and l are the number of Patients and EC elements, respectively.</p><p>1) Matching Patient to Eligibility Criteria: Once phrases embedding are computed for the patients and the clinical trial eligibility criteria, we calculate the similarity between phrases of the same class (Diagnosis, Drug, Procedure,... ) as defined in sub-section III-B. An element s i,j of S represents the similarity between patient criteria P i and single eligibility criteria ec j . The similarity function is defined by calculating the cosine between each phrase ph i,r extracted from DS i and ec j , then only the higher cosine value of similarity is retained for s i,j and all other values are discarded.</p><formula xml:id="formula_3">s i,j = max 8phi,r2DSi</formula><p>(cos(ph i,r , ec j ))</p><p>(3)</p><formula xml:id="formula_4">i 2 [1, n] &amp;j 2 [1, l]</formula><p>Once the similarity values obtained, the final representation of S would be as follows: </p><formula xml:id="formula_5">S S S =</formula><p>2) Ranking and Scoring Patients: The semantic cosine similarity calculated in the previous paragraph enables a proportional similarity instead of exact text semantic matching. Therefore, when we compare similarity values obtained for different features (eligibility criteria) in the generated matrix S, we notice that just because the value of similarity is higher, that does not mean that the similarity with the patient is greater. For example if s x,1 and s y,2 represent the highest value of the features ec 1 and ec 2 , respectively, and if s x,1 &gt; s y,2 , this does not mean that P x has a phrase more similar to ec 1 than P y for ec 2 (as a noticed in equation 2), but only means that P x and P y are ranked respectively at the top similar of the list for ec 1 and ec 2 . The same logic applies for the lowest value, which represents the last order of similarity. This variation in the similarity values between features requires a range normalization step to enable rank similarity instead of cosine similarity, which supports perfectly the computation of a matching score between patients and the Clinical Trial. To this end, we generated a new matrix R by applying the following feature scaling normalization:</p><formula xml:id="formula_7">r i,j = ( n ⇥ si,j min 8i (si,j ) max 8i (si,j ) min 8i (si,j ) ; ec j 2 IEC ( n) ⇥ si,j min 8i (si,j ) max 8i (si,j ) min 8i (si,j ) ; ec j 2 EEC<label>(4)</label></formula><p>Finally, the matching score M of Patient P i with a Clinical Trial is determined by:</p><formula xml:id="formula_8">M(P i , CT ) = l X j=1 rij.<label>(5)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EVALUATION</head><p>To validate our framework, we used two datasets; MIMIC-III (Medical Information Mart for Intensive Care) <ref type="bibr" target="#b37">[38]</ref> comprising information relating to patients admitted to critical care units, and Clinical Trials<ref type="foot" target="#foot_0">2</ref> a Web-based resource providing access to information on supported clinical studies.  A. Text processing MIMIC III Clinical Dataset is a critical care database that contains 2,083,108 medical reports from 46,520 patients. We experimented with a randomly selected dataset of 100 Discharge Summaries from patients last visit, excluding patients whose ages are under 18. The segmentation stage produces an average of 400 phrases per report.</p><p>We selected a clinical trial that identifies the role of Aldosterone antagonist in patients of heart failure with preserved ejection fraction (NCT04078425). Fig. <ref type="figure" target="#fig_3">3</ref> shows the five eligibility criteria of this clinical trial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Evaluation of the obtained results</head><p>Table <ref type="table" target="#tab_3">III</ref> presents the results for a sample of ten patients. In order to evaluate the clinical correctness of patients matching to the clinical trial(NCT04078425), a validation task was performed manually by a nurse and a computer science student. The noteworthy fact is that the evaluation of the matching does not reveal false positives in the score results. Indeed, the similarity scores reflect the order of matching between patients and the clinical trial. The score distribution ranged from (-15) to <ref type="bibr" target="#b7">(8)</ref>, and eligible patients to be retained for further screening by experts were those with a score greater than 5.</p><p>We should note that the scores would be more realistic if the segmentation process was more accurate. For instance, the sentence "you were thought to have a blood clot in your right leg" was segmented by Metamap into "a blood clot in your right leg" which would result in a false outcome.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONCLUSION</head><p>EMRs contain a large portion of unstructured data that need to be matched with eligibility criteria for trial-patient enrollment. Indeed, the gradual improvement of artificial intelligence technology could reduce the number of physician-hours spent in screening patient eligibility. To tackle the problem, we proposed a framework designed to automatically recommend the most suitable patients for a clinical trial. The framework adopts a pre-trained language model (BioBERT) and uses STS-B and MedNLI datasets to improve the accuracy of the model via transfer learning. This work verified that the fine-tuning of BioBERT shows better performance in calculating the similarity between two medical sentences using embedding-based metrics. In future works, we will also explore EMRs structured tables in order to significantly improve the performance and accuracy of our trial-patient matching framework.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. An example of discharge summary contents and format.</figDesc><graphic coords="2,66.65,88.43,205.99,82.03" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Framework overview</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>2 4</head><label>2</label><figDesc>max ph1,r (cos(ph 1,r , ec 1 )) . . . . max phn,r (cos(ph n,r , ec l ))3</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. The eligibility criteria specified in the NCT04078425 clinical trial</figDesc><graphic coords="6,79.08,88.43,181.13,106.31" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>TABLE I EXAMPLE</head><label>I</label><figDesc>OF SENTENCES SEGMENTATION INTO PHRASES</figDesc><table><row><cell>Paragraph</cell><cell>Phrases</cell></row><row><cell cols="2">Eligibility Crieteria NCT03484780</cell></row><row><cell>Previous open laparotomy</cell><cell>1-Previous open laparotomy</cell></row><row><cell>or contraindications to</cell><cell>2-contraindications to laparoscopy</cell></row><row><cell>laparoscopy, as determined by</cell><cell>3-determined by implanting</cell></row><row><cell>implanting physician.</cell><cell>physician</cell></row><row><cell cols="2">Discharge Summary</cell></row><row><cell>History of paroxysmal atrial fibrillation with anticoagulation in the past. History of coronary artery disease status post myocardial infarction</cell><cell>1-History of paroxysmal atrial fibrillation 2-with anticoagulation in the past. 3-History of coronary artery disease 4-status post myocardial infarction</cell></row><row><cell cols="2">trials by calculating the similarity of each phrase in the</cell></row><row><cell cols="2">discharge summary to each phrase in Eligibility Criteria (EC).</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>TABLE II NLI</head><label>II</label><figDesc>AND COS SIMILARITY BEFORE AND AFTER FINE-TUNING OF BIOBERT</figDesc><table><row><cell></cell><cell></cell><cell>Experts</cell><cell cols="2">Pre-trained BioBERT</cell><cell></cell><cell></cell></row><row><cell>Phrase 1 (P1)</cell><cell>Phrase 2 (P2)</cell><cell>NLI(P1, P2)</cell><cell>Cos(P1, P2)</cell><cell>Rank</cell><cell>Cos(P1, P2)</cell><cell>Rank</cell></row><row><cell></cell><cell>patient has history of stroke</cell><cell>Entailment</cell><cell>0.89</cell><cell>1.53</cell><cell>0.87</cell><cell>3.00</cell></row><row><cell>History of CVA</cell><cell>patient has normal brain mri</cell><cell>Contradiction</cell><cell>0.91</cell><cell>3.00</cell><cell>0.75</cell><cell>0.00</cell></row><row><cell></cell><cell>patient is hemiplegic</cell><cell>Neutral</cell><cell>0.86</cell><cell>0.00</cell><cell>0.77</cell><cell>0.38</cell></row><row><cell>Per report ECG with initial qtc of 410</cell><cell>Patient has abnormal EKG findings.</cell><cell>Entailment</cell><cell>0.89</cell><cell>2.05</cell><cell>0.82</cell><cell>3.00</cell></row><row><cell>now 475, QRS 82 initially, now 86</cell><cell>Patient has normal EKG.</cell><cell>Contradiction</cell><cell>0.90</cell><cell>3.00</cell><cell>0.80</cell><cell>2.30</cell></row><row><cell>rate= 95.</cell><cell>Patient has angina.</cell><cell>Neutral</cell><cell>0.88</cell><cell>0.00</cell><cell>0.73</cell><cell>0.00</cell></row><row><cell></cell><cell>the patient was in a MVC.</cell><cell>Entailment</cell><cell>0.89</cell><cell>3.00</cell><cell>0.82</cell><cell>3.00</cell></row><row><cell></cell><cell>the patient has no medical history.</cell><cell>Contradiction</cell><cell>0.88</cell><cell>2.48</cell><cell>0.53</cell><cell>0.00</cell></row><row><cell></cell><cell>the patient has no significant injuries.</cell><cell>Neutral</cell><cell>0.86</cell><cell>0.00</cell><cell>0.67</cell><cell>1.51</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>TABLE III RANKS</head><label>III</label><figDesc>AND SCORES OF MATCHING 10 PATIENTS WITH 6 ELIGIBILITY CRITERIA (NCT04078425)</figDesc><table><row><cell></cell><cell>iec1</cell><cell>iec2</cell><cell>eec1</cell><cell>eec2</cell><cell>eec3</cell><cell>eec4</cell><cell>Score</cell></row><row><cell>P-1</cell><cell>9.46</cell><cell>9.84</cell><cell>-8.47</cell><cell>-2.74</cell><cell>-1.02</cell><cell>-1.02</cell><cell>6.03</cell></row><row><cell>P-2</cell><cell>5.02</cell><cell>7.44</cell><cell>-3.43</cell><cell>-3.12</cell><cell>-8.42</cell><cell>-8.42</cell><cell>-10.93</cell></row><row><cell>P-3</cell><cell>9.08</cell><cell>8.65</cell><cell>-10.00</cell><cell>-2.38</cell><cell>-10.00</cell><cell>-10.00</cell><cell>-14.65</cell></row><row><cell>P-4</cell><cell>0.00</cell><cell>4.09</cell><cell>-2.96</cell><cell>-5.76</cell><cell>-5.24</cell><cell>-5.24</cell><cell>-15.12</cell></row><row><cell>P-5</cell><cell>3.43</cell><cell>4.02</cell><cell>-6.26</cell><cell>-2.69</cell><cell>-1.09</cell><cell>-1.09</cell><cell>-3.69</cell></row><row><cell>P-6</cell><cell>5.19</cell><cell>0.00</cell><cell>0.00</cell><cell>-1.42</cell><cell>0.00</cell><cell>0.00</cell><cell>3.77</cell></row><row><cell>P-7</cell><cell>5.65</cell><cell>2.95</cell><cell>-3.86</cell><cell>-2.72</cell><cell>-0.15</cell><cell>-0.15</cell><cell>1.72</cell></row><row><cell>P-8</cell><cell>7.26</cell><cell>10.00</cell><cell>-7.52</cell><cell>-5.76</cell><cell>-6.98</cell><cell>-6.98</cell><cell>-9.99</cell></row><row><cell>P-9</cell><cell>6.43</cell><cell>9.14</cell><cell>-4.44</cell><cell>-10.00</cell><cell>-2.70</cell><cell>-2.70</cell><cell>-4.27</cell></row><row><cell>P-10</cell><cell>10.00</cell><cell>7.44</cell><cell>-6.27</cell><cell>0.00</cell><cell>-10.00</cell><cell>-10.00</cell><cell>-8.83</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">https://clinicaltrials.gov/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENT</head><p>The authors would like to thank Marvin Moughabghab for his efforts and contributions to this work.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">In search of big medical data integration solutions-a comprehensive survey</title>
		<author>
			<persName><forename type="first">H</forename><surname>Dhayne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Haque</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kilany</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Taher</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="91" to="265" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Using electronic health records for clinical research: the case of the ehr4cr project</title>
		<author>
			<persName><forename type="first">G</forename><surname>De Moor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sundgren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kalra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dugas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Claerhout</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Karakoyun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ohmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-Y</forename><surname>Lastic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ammour</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="page" from="162" to="173" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Routine data from hospital information systems can support patient recruitment for clinical studies</title>
		<author>
			<persName><forename type="first">M</forename><surname>Dugas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lange</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Müller-Tidow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kirchhof</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-U</forename><surname>Prokosch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Clinical Trials</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="183" to="189" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Data from clinical notes: a perspective on the tension between structure and flexible documentation</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Rosenbloom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Denny</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lorenzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">W</forename><surname>Stead</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">B</forename><surname>Johnson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="181" to="186" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Sedie: A semanticdriven engine for integration of healthcare data</title>
		<author>
			<persName><forename type="first">H</forename><surname>Dhayne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kilany</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Haque</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Taher</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="617" to="622" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">How essential are unstructured clinical narratives and information fusion to clinical trial recruitment?</title>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fosler-Lussier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Lai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AMIA Summits on Translational Science Proceedings</title>
		<imprint>
			<biblScope unit="volume">2014</biblScope>
			<biblScope unit="page">218</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A practical method for transforming free-text eligibility criteria into computable criteria</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Peleg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Carini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bobak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Rubin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="239" to="250" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Deficits in communication and information transfer between hospital-based and primary care physicians: implications for patient safety and continuity of care</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kripalani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lefevre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">O</forename><surname>Phillips</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">V</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basaviah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Baker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Jama</title>
		<imprint>
			<biblScope unit="volume">297</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="831" to="841" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Enhancing reuse of structured eligibility criteria and supporting their relaxation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Milian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hoekstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bucur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Teije</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Van Harmelen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Paulissen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="page" from="205" to="219" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Matching patient records to clinical trials using ontologies</title>
		<author>
			<persName><forename type="first">C</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cimino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dolby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fokoue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kalyanpur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kershenbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schonberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Srinivas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Semantic Web</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="816" to="829" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Eliie: An open-source information extraction system for clinical trial eligibility criteria</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Hruby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rusanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Elhadad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Weng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1062" to="1071" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Criteria2query: a natural language interface to clinical databases for cohort definition</title>
		<author>
			<persName><forename type="first">C</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">B</forename><surname>Ryan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hardin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Makadia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="294" to="305" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Textual inference for eligibility criteria resolution in clinical trials</title>
		<author>
			<persName><forename type="first">C</forename><surname>Shivade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hebert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lopetegui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-C</forename><surname>De Marneffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fosler-Lussier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Lai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="S211" to="S218" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Automated clinical trial eligibility prescreening: increasing the efficiency of patient identification for clinical trials in the emergency department</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Ni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kennebeck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Dexheimer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Mcaneney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lingren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Solti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="166" to="178" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Tokenisation and sentence segmentation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Palmer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Handbook of natural language processing</title>
				<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="11" to="35" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Effective mapping of biomedical text to the umls metathesaurus: the metamap program</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Aronson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AMIA Symposium</title>
				<meeting>the AMIA Symposium</meeting>
		<imprint>
			<publisher>American Medical Informatics Association</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page">17</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">An overview of metamap: historical perspective and recent advances</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Aronson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F.-M</forename><surname>Lang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="229" to="236" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">A blocking framework for entity resolution in highly heterogeneous information spaces</title>
		<author>
			<persName><forename type="first">G</forename><surname>Papadakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ioannou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Palpanas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Niederee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Nejdl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="2665" to="2682" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">A c-lstm neural network for text classification</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lau</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1511.08630</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Distributional semantics resources for biomedical text processing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Moen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">S S</forename><surname>Ananiadou</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Biosses: semantic sentence similarity estimation system for the biomedical domain</title>
		<author>
			<persName><forename type="first">G</forename><surname>Sogancıoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Öztürk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Özgür</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">14</biblScope>
			<biblScope unit="page" from="49" to="58" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Universal sentence encoder</title>
		<author>
			<persName><forename type="first">D</forename><surname>Cer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>-Y. Kong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Hua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Limtiaco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">S</forename><surname>John</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Constant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Guajardo-Cespedes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tar</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1803.11175</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Biosentvec: creating sentence embeddings for biomedical texts</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.09302</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Iyyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gardner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.05365</idno>
		<title level="m">Deep contextualized word representations</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Improving language understanding by generative pretraining</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Narasimhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Salimans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<ptr target="https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Testing the generalization power of neural network models across nli benchmarks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Talman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chatzikyriakidis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</title>
				<meeting>the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="85" to="94" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<title level="m" type="main">Xlnet: Generalized autoregressive pretraining for language understanding</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carbonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1906.08237</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">Glue: A multi-task benchmark and analysis platform for natural language understanding</title>
		<author>
			<persName><forename type="first">A</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Michael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>Bowman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1804.07461</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yoon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">H</forename><surname>So</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1901.08746</idno>
		<title level="m">Biobert: pre-trained biomedical language representation model for biomedical text mining</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Xiao</surname></persName>
		</author>
		<ptr target="https://github.com/hanxiao/bert-as-service" />
		<title level="m">bert-as-service</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Transfer learning in natural language processing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ruder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Swayamdipta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials</title>
				<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="15" to="18" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<title level="m" type="main">Sentence-bert: Sentence embeddings using siamese bert-networks</title>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.10084</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<title level="m" type="main">Supervised learning of universal sentence representations from natural language inference data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Conneau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kiela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schwenk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Barrault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1705.02364</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<title level="m" type="main">google-research/bert: Tensorflow code and pre-trained models for bert</title>
		<ptr target="https://github.com/google-research/bert" />
		<imprint>
			<date type="published" when="2019">09/17/2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<title level="m" type="main">Lessons from natural language inference in the clinical domain</title>
		<author>
			<persName><forename type="first">A</forename><surname>Romanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Shivade</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1808.06752</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Cer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Diab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Agirre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Lopez-Gazpio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.00055</idno>
		<title level="m">Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Mimic-iii, a freely accessible critical care database</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">J</forename><surname>Pollard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">L</forename><surname>Li-Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ghassemi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Moody</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Szolovits</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Celi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">G</forename><surname>Mark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific data</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page">160035</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
