Using Embedding-based Metrics to expedite patients recruitment process for clinical trials 1st Houssein Dhayne 2nd Rima Kilany Faculty of Engineering, ESIB Faculty of Engineering, ESIB Saint Joseph University Saint Joseph University Beirut, Lebanon Beirut, Lebanon houssein.dhayne@net.usj.edu.lb rima.kilany@usj.edu.lb Abstract—Despite the unprecedented volumes of Electronic the most flexible way for physicians to express case nuances Medical Records (EMRs) generated daily across healthcare and clinical reasoning [4]. These free texts usually contain facilities, the ability to leverage these data for patient partici- important facts about patients, but they are rarely available pation in clinical trial remains overwhelmingly unfulfilled. The reason behind this is that matching patient information to the for formal queries [5]. eligibility criteria for clinical trials is a manual, effort-consuming On the other hand, eligibility criteria for a clinical trial process. Therefore, automating this process is an essential step describes the characteristics of patients who are qualified to in improving the number of patients participating in clinical participate in the trial. Each criterion is usually expressed as research. To address this issue, we propose a novel framework a descriptive text and specified in the form of inclusion and for automated patients to clinical trials matching. The matching process is based on measuring the similarity score between exclusion criteria. Therefore, free text criteria can not always phrases extracted from patient medical records and the eligibility be transformed into structured data representations. criterion for a trial. Authors in [6] confirmed that using only structured data Our solution is based on a combination of NLP techniques from the EMR is insufficient in resolving eligibility criteria and modern deep learning-based NLP models. In this context, for patient recruitment in clinical trials, and that unstructured we follow pre-training and transfer learning approaches to help the model learn task-specific reasoning skills. Additionally, we data is essential to resolve 59% to 77% of the trial criteria. perform supervised fine-tuning on large Medical Natural Lan- However, matching clinical notes with eligibility criteria is guage Inference (MedNLI) and Semantic Textual Similarity (STS- still a manually performed task, which makes it an expensive B) datasets. The matching process was performed at semantic process in terms of time and effort. This slows down clinical phrases level by converting patient information and trial criteria trials and may delay new drugs from benefiting patients. As a into vector representations. We then used a scoring function that combined cosine similarity and scaling normalization to identify consequence, it might entail the loss of human lives that oth- potential patient-trial matches. The experimental results have erwise would have been able to benefit from new medication. shown that our framework is highly effective in sorting out For these reasons, automated matching of clinical notes with patients by their similarity scores. eligibility criteria in the eligibility screening workflow would Index Terms—NLP, NLI, EMR, Automated clinical trial eligi- help overcome the bottlenecks of pre-screening practices in a bility screening, BioBERT, Sentence similarity trial setting. To tackle the above challenge efficiently, we need to execute I. I NTRODUCTION a matching process at a semantic sentence level, rather than by The widespread adoption and use of electronic medical just checking for the presence or absence of a lexical criterion. records (EMRs), together with the development of advanced The investigation of the potential use of modern deep learning- artificial intelligence models, offer remarkable opportunities based NLP(Natural Language Processing) models, led us to for improving the clinical research sector [1]. Furthermore, propose a framework that would automate the evaluation of EMRs offer a wide range of potential uses in clinical trials the eligibility of patients to be candidates for a relevant clinical such as facilitating the clinical trial feasibility assessment and trial. As a first step, the framework splits patient clinical patient recruitment, as well as obtaining main patient health report and clinical trial sentences into comparatively basic information and medical history prior to their screening visit. phrase units. Secondly, it classifies the phrases into various The latter is a critical step in reducing the costs and duration clinical categories (diagnosis, drug, procedure, observation). of clinical trials [2]. Additionally, linking EMRs with clinical Thirdly, the framework converts candidate phrases into vec- trials has been shown to increase patient recruitment rate [3]. tor representations using an appropriate deep learning-based However, there are many barriers to overcome in order to use NLP model. Finally, it calculates a semantic matching score EMRs for clinical trials. between patients and a clinical trial by using a combination of Even though EMRs were designed to record information in cosine similarity alongside a scaling normalization method. a structured format, such as procedure information, diagnosis This paper is organized as follows: In section II, we expose codes, drug prescriptions, and lab results, free text remains the problem definition and review the related works. In sec- Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 23 C. Related work In the recent past, several projects have developed tools and technologies for automated trial-patient matching. Milian et al. [9] used a template-based formalism to extract and represent the semantics of the trial criteria in order to improve their comparability. Patel et al. [10] formulated the matching process as a semantic retrieval problem by expressing clinical trial criterion in the form of semantic query, which a reasoner Fig. 1. An example of discharge summary contents and format. can then use with a formal medical ontology - SNOMED CT to retrieve eligible patients. Other works such as EliIE [11] and Criteria2Query [12] have focused on identifying standardized tion III, we describe our framework and illustrate the different medical entities in eligibility criteria using machine learning challenges. The evaluation of the results and outcomes is approaches, the extracted entities being then used to query pa- discussed in section IV. Finally, we conclude this paper in tient data. Shivade et al. [13] constructed an annotated dataset section V. that determined whether the medical note contains text that II. BACKGROUND meets a criterion or not. Then, they implemented two lexical A. Problem definition methods and two semantic methods to determine a relevance score of each sentence with a criterion statement, and found According to our approach, the problem definition of that semantic methods gave better results than lexical methods. patient-trial matching can be described as follows: Ni et al [14] evaluated a system using a combination of NLP, Finding clinical trial participants is the task of matching information retrieval and machine learning methods to identify Patient Pi (Pi 2 EM R) represented by a Discharge Summary a cohort of patients for clinical trial eligibility pre-screening. DSi to a Clinical Trial CT represented by an Eligibility Their system relies on both structured data and clinical notes Criteria EC. Formally, the solution to this task is to find from EMRs. the top-K highest-values of function M which computes the matching score denoted by: III. F RAMEWORK OVERVIEW M(Pi , CT ) = v which represents the score of matching patient Pi to a CT . In this section, we describe the framework we propose This list of the top-K highest-scores reduces the overall for automating the matching process between patients and a number of patients that will need to be screened by clinicians clinical trial. This framework takes into account the following in order to identify eligible patients. different challenges; (i) In order to treat complex sentences in patient’s data as well as in clinical trials, we break down B. Data representation paragraphs into sentences and complex sentences are then 1) Clinical trial: A clinical trial is a type of research that parsed into phrases. These phrases are the basic units for provides a longstanding foundation in the practice of medicine matching. (ii) To avoid costly comparisons without fault dis- and the evaluation of new medical treatments. Each trial has missals, phrases are partitioned using classification methods, eligibility criteria describing the characteristics according to which limits the number of pairs to match. (iii) To match which a patient or participant must meet all inclusion criteria phrases, we represent them in the form of distributed vectors, and none of the exclusion criteria. In this respect, the criteria which enables calculating similarity for formally different but differ from study to study. Authors in [7] analysed 1000 semantically related phrases. Fig. 2 shows an overview of eligibility criteria and showed that 23% of the criteria are our Patients to Clinical Trial matching framework. Given a simple, or can be reduced to simple criteria, and that 77% of Clinical Trial CT and set of Patients P, our task is to calculate the criteria remain complex to evaluate. Therefore, a formally a Matching score M(Pi , CT ). computable representation of eligibility criteria would require natural language processing techniques as part of automated A. Paragraph and sentence decomposition screening for patient eligibility. In order to measure the similarity between two sentences, 2) Patient medical records: An EMR typically collects var- we have to deal with a simple sentence representing a ious types of patient information, including patient discharge linguistically-meaningful unit. This process requires segment- summaries, prior diagnoses, radiology reports, medication his- ing both paragraph-level and sentence-level structures into tory, and so on. Hospital discharge summaries are a physician- phrase-level structures. According to [15], segmentation of authored synopsis of a patient’s hospital stay, which serve paragraphs and sentences is the process of parsing the longer as the main documents communicating a patients care plan processing units, consisting of one or more words, to further to the post-hospital care team [8]. Discharge summaries are processing stages such as part-of-speech parsers, morpholog- organized in several sections. These sections usually include ical analyzers, etc. past medical history and history of present illness as shown In our model, we handle each phrase as a primitive semantic in fig.1. unit and find matching phrases between patient and clinical 24 TABLE I selected in our exploration, with a Precision of 0.87, a Recall E XAMPLE OF SENTENCES SEGMENTATION INTO PHRASES of 0.88, and a F1-score of 0.875. We therefore adopted CNN + Paragraph Phrases PubMed-and-PMC-w2v to perform this classification task and Eligibility Crieteria NCT03484780 were able to categorize the phrases into the four pre-mentioned Previous open laparotomy 1- Previous open laparotomy categories. or contraindications to 2- contraindications to laparoscopy laparoscopy, as determined by 3- determined by implanting C. Phrase vector representations implanting physician. physician Discharge Summary The purpose of this work is to allow the matching of patients History of paroxysmal atrial 1- History of paroxysmal atrial data and clinical trials by comparing unstructured data from fibrillation both datasets. Our claim is that by measuring the similarity fibrillation with anticoagulation 2- with anticoagulation in the past. in the past. History of coronary 3- History of coronary artery of primitive semantic medical units (medical phrases) of a artery disease status post myocardial infarction disease patient’s Discharge Summary and Eligibility Criteria, we can 4- status post myocardial infarction generate a score value supporting the matching task. There are plenty of measures of semantic similarity between sentences used in NLP. Unsupervised and supervised methods trials by calculating the similarity of each phrase in the have been used to calculate the semantic similarity between discharge summary to each phrase in Eligibility Criteria (EC). two sentences in the biomedical domain [21]. Recently, a We used paragraph and sentence segmentation of number of novel approaches have been proposed to address MetaMap [16]. MetaMap was provided by the National this problem by producing sentence vectors [22]. As an Library of Medicine (NLM) to map Medical Language example, Neural sentence-embedding methods [23] have been Processor (MLP) text to the UMLS Metathesaurus shown to outperform traditional approaches, such as TF-IDF concepts [17]. MetaMap breaks text into paragraphs, and word overlap based measures. sentences, and then phrases. Table I presents a simple 1) Universal sentence embeddings: The concept of uni- example of segmenting sentences into phrases. The first refers versal sentence embeddings has grown in popularity as it to the eligibility criteria (NCT03484780) and the second leverages models trained on large text corpora. These pre- illustrates an example from a patient discharge summary. trained models can be used in a wide range of downstream B. Phrases classification tasks, such as providing versatile sentence-embedding models that convert sentences into vector representations. Notable A discharge summary report contains information about works include ELMo [24], GPT [25], and BERT [26]. different topics. Therefore, the large number of heterogeneous 2) BioBERT: BERT (Bidirectional Encoder Representa- phrases extracted from the patient reports may affect the tions from Transformers) is a neural network language model efficiency and effectiveness of pairwise phrase matching [18]. trained on plain text for masked word prediction and next sen- To minimize the number of required comparisons, we tence prediction tasks. BERT applies multi-layer bidirectional applied a filtering methodology. The latter aims to filter all transformer encoder with self-attention. According to [27], the classes of phrases that do not correspond to a given class, BERT overall achieved state-of-the-art performances in many which limits the number of pairs to match. Natural Language Processing tasks and was significantly better Data classification techniques could support achieving this than other models. However, compared against more recent filtering by separating phrases extracted from patient data and models, XLNet [28] outperforms BERT and achieves better clinical trial into different medical categories. This classifica- prediction metrics on the GLUE benchmark [29], but is not yet tion filters-out non-matching pairs prior to verification, which widely used in the medical field. Applying the same architec- increases the efficiency of phrases similarity matching with ture as BERT, Lee et al. [30] proposed the BioBERT language high precision and without sacrificing recall. model trained on biomedical corpora including PubMED and In our study, a total of 1500 eligibility criteria were extracted PMC. The BioBERT model showed promising results in the from a Clinical Trials database1 and were manually labelled by biomedical domain. a certified nurse and a data science master student according 3) Phrase embedding: In this respect, to generate context- to four classes (diagnosis, drug, procedure, observation). rich phrase embeddings, we chose BioBERT as the language In this work, we have empirically explored and compared model in conjunction with the Bert-as-service library [31]. four methods widely used in classification as our baseline: Bert-as-service is a feature extraction service based on BERT SVM, CNN, LSTM, C-LSTM [19], in order to identify the which uses two strategies to derive a fixed-sized vector. In the ones with the best performance. For SVM and CNN models, default strategy, Bert-as-service does average pooling of all we initialized word embeddings by the average of the word of the tokens of second-to-last hidden layer, while the second embedding over all words in the sentence via PubMed-and- uses the output of the special CLS token and is recommended PMC-w2v [20]. only after fine-tuning BERT on a downstream task. Our experiment indicates that CNN + w2v model has the D. Phrases Similarity Measures best prediction performance in comparison to the other models The similarity between two vectors can be evaluated using 1 https://clinicaltrials.gov/ various similarity measures such as Cosine similarity, Eu- 25 Phrases Phrase Phrase Maximum Ranking Segmentation Classification Embedding Cosine & Discharge similarity Scoring Summary Diagnosis Pairwise Cosine Drug Distance Clinical Trial Procedure iec1 iec2 iec3 eec4 ec1 ec2 ec3 ec4 Score P1 0.9 0.6 0.3 0.5 P1 5.0 3.3 1.4 -2.5 7.3 P2 0.7 0.3 0.1 0.8 P2 3.8 0.8 0.0 -4.4 0.2 Eligibility P3 0.2 0.4 0.2 0.1 P3 0.6 1.7 0.7 0.0 3.0 P4 0.5 0.8 0.7 0.2 P4 2.5 5.0 4.3 -0.6 11.2 Criteria Diagnosis P5 0.1 0.2 0.8 0.9 P5 0.0 0.0 5.0 -5.0 0.0 Drug Max similarity between Patients' similarity Fine-tuned Procedure BioBERT Patient_5 phrases and ranking regarding Eligibility_Criteria-1 EC1 Fig. 2. Framework overview clidean distance, and Manhattan distance. Since these simi- reducing thus the need for many heavily-engineered task- larity metrics are a linear space in which all dimensions are specific architectures. weighted equally, we perform here the similarity matching In the context of natural language understanding (NLU) metrics of different phrases by ranking these phrases according technology, comparing the relationship between two sen- to the cosine similarity. Therefore, the rank of similarity can tences is based on several downstream tasks such as Natural be obtained by the equations presented in (1) and (2). Clinical Trial Language Inference (NLI) and Semantic Textual Similarity iec1 iec2 iec3 eec4 (STS) [29]. Besides that, authors in [33] have shown that x ·y P1 0.9 x, y )0.3 cos(x 0.6 = (1) fine-tuning BERT on NLI and STS datasets creates sentence x0.5 ||x || · ||yy || P2 0.7 0.3 0.1 0.8 embeddings which achieve an improvement of 11.7 points P3 0.2 0.4 0.2 0.1 compared to InferSent [34] and 5.5 points compared to the P4 0.5 if 0.8 A, B cos(A 0.7) > 0.2 cos(A A, C ) (2) Universal Sentence Encoder [22]. In this context, we first P5 0.1 0.2 0.8 0.9 then A is more similar to B than C . fine-tuned BioBERT on STS-B dataset that generated our BioBERT-based model. We then further fine-tuned on MedNLI Whereas a pre-trained BioBERT knowledge often shows a dataset. We used the fine-tuning classifier from BERT sys- ec1 ec2 ec3 ec4 Score good performance for certain tasks, as we shall see later on, tems [35]. this prior knowledge is not sufficient to compute the similarity P1 5.0 3.3 1.4 -2.5 7.3 • MedNLI [36]: is a large, publicly available, expert anno- P2 3.8 0.8 0.0 -4.4 0.2 of sentencesP3based 0.6 on 1.7 their embeddings. 0.7 0.0 3.0Indeed, we first tried tated dataset drawn from the medical history section of to compute P4the 2.5 cosine5.0similarity 4.3 of sentences, -0.6 11.2 annotated by MIMIC-III. MedNLI includes a set of clinical sentence P5 0.0 0.0 5.0 -5.0 0.0 experts, using extracted embedding from pre-trained BioBert, pairs(14,049 pairs). They were annotated with one of without any fine-tuning. The result of the comparison was three classes: entailment, contradiction, and neutral. unsatisfactory and unacceptable (table II). The most significant • STS-B [37]: is a collection of sentence pairs selected from sentence is the exact opposite, for example; the most similar news headlines. The dataset consists of paired sentences sentence of ”History of CVA” was ”patient has normal brain (8,628 pairs) labelled by humans with a similarity score MRI” with similarity value of 0.91 which was annotated of 1 to 5 denoting how similar the two sentences are in by experts as ”contradiction”, and the ”Entailment” sentence terms of semantic meaning. ”patient has history of stroke” appears in the second place 2) Evaluation of fine-tuned BioBERT: We evaluated the with similarity value of 0.89. Therefore, foregoing experiments new BioBERT model by computing the cosine similarity reinforced our belief that it is necessary to fine-tune BioBERT between the phrase embeddings. We observed that the model, on our downstream task. was not just able to rank phrases in terms of similarity, but 1) Supervised Fine-tuning: Transfer learning is the process also gave a more appropriate cosine value. A representative of extending a pre-trained model by leveraging data from an sample of the results is depicted in Table II. additional domain for a better model generalization [32]. The most common transfer learning techniques in NLP is fine- E. Matching Patients to Clinical Trials tuning. Fine-tuning involves copying the weights from a pre- After fine-tuning the BioBERT model for optimized cosine trained network and tuning them using labeled data from the similarity and creating both Discharge Summary and Clinical downstream tasks. BERT is a fine-tuning based representation Trial phrases embeddings, we proceeded to find Clinical Trial model that achieves state-of-the-art performance on a large participants from an EMR dataset. suite of sentence-level tasks, with pre-trained representations Formally, we denote: 26 TABLE II NLI AND C OS SIMILARITY BEFORE AND AFTER FINE - TUNING OF B IO BERT Experts Pre-trained BioBERT Fine-tuned BioBERT Phrase 1 (P1) Phrase 2 (P2) NLI(P1, P2) Cos(P1, P2) Rank Cos(P1, P2) Rank patient has history of stroke Entailment 0.89 1.53 0.87 3.00 History of CVA patient has normal brain mri Contradiction 0.91 3.00 0.75 0.00 patient is hemiplegic Neutral 0.86 0.00 0.77 0.38 Patient has abnormal EKG findings. Entailment 0.89 2.05 0.82 3.00 Per report ECG with initial qtc of 410 now 475, QRS 82 initially, now 86 Patient has normal EKG. Contradiction 0.90 3.00 0.80 2.30 rate= 95. Patient has angina. Neutral 0.88 0.00 0.73 0.00 History of hypercholesterolemia and the patient was in a MVC. Entailment 0.89 3.00 0.82 3.00 peptic ulcer disease s/p gastric bypass the patient has no medical history. Contradiction 0.88 2.48 0.53 0.00 some years ago was involved in a low- speed MVC. the patient has no significant injuries. Neutral 0.86 0.00 0.67 1.51 • DSi = {phi,1 , phi,2 , ..., phi,r } as the phrases extracted different features (eligibility criteria) in the generated matrix from Discharge Summary of patient Pi . S, we notice that just because the value of similarity is higher, • IEC = {iec1 , iec2 , ..., iecp } as the phrases extracted that does not mean that the similarity with the patient is from Inclusion Eligibility Criteria. greater. For example if sx,1 and sy,2 represent the highest value • EEC = {eec1 , eec2 , ..., eecq } as the phrases extracted of the features ec1 and ec2 , respectively, and if sx,1 > sy,2 , from Exclusion Eligibility Criteria. this does not mean that Px has a phrase more similar to ec1 • EC = {ec1 , ec2 , ..., ecl } = IEC [ EEC | l = p + q as than Py for ec2 (as a noticed in equation 2), but only means all phrases extracted from Eligibility Criteria. that Px and Py are ranked respectively at the top similar of n⇤l • S 2 [0, 1] as the cosine Similarity matrix, where the list for ec1 and ec2 . The same logic applies for the lowest n and l are the number of Patients and EC elements, value, which represents the last order of similarity. respectively. This variation in the similarity values between features 1) Matching Patient to Eligibility Criteria: Once phrases requires a range normalization step to enable rank similarity embedding are computed for the patients and the clinical instead of cosine similarity, which supports perfectly the trial eligibility criteria, we calculate the similarity between computation of a matching score between patients and the phrases of the same class (Diagnosis, Drug, Procedure,... ) as Clinical Trial. To this end, we generated a new matrix R by defined in sub-section III-B. An element si,j of S represents applying the following feature scaling normalization: the similarity between patient criteria Pi and single eligibility criteria ecj . The similarity function is defined by calculating ( s min (s ) the cosine between each phrase phi,r extracted from DSi and n ⇥ max8ii,j(si,j ) 8i i,j min8i (si,j ) ; ecj 2 IEC ri,j = s min (s ) ecj , then only the higher cosine value of similarity is retained ( n) ⇥ max8ii,j(si,j ) 8i i,j min8i (si,j ) ; ecj 2 EEC for si,j and all other values are discarded. (4) Finally, the matching score M of Patient Pi with a Clinical si,j = max (cos(phi,r , ecj )) (3) Trial is determined by: 8phi,r 2DSi i 2 [1, n] &j 2 [1, l] l X M(Pi , CT ) = rij. (5) Once the similarity values obtained, the final representation j=1 of S would be as follows: IV. E VALUATION 2 3 maxph1,r (cos(ph1,r , ec1 )) . To validate our framework, we used two datasets; MIMIC- S =4 . . 5 III (Medical Information Mart for Intensive Care) [38] com- . maxphn,r (cos(phn,r , ecl )) prising information relating to patients admitted to critical care 2) Ranking and Scoring Patients: The semantic cosine units, and Clinical Trials 2 a Web-based resource providing similarity calculated in the previous paragraph enables a access to information on supported clinical studies. proportional similarity instead of exact text semantic matching. Therefore, when we compare similarity values obtained for 2 https://clinicaltrials.gov/ 27 V. C ONCLUSION EMRs contain a large portion of unstructured data that need to be matched with eligibility criteria for trial-patient enroll- ment. Indeed, the gradual improvement of artificial intelligence technology could reduce the number of physician-hours spent in screening patient eligibility. To tackle the problem, we pro- posed a framework designed to automatically recommend the most suitable patients for a clinical trial. The framework adopts a pre-trained language model (BioBERT) and uses STS-B and MedNLI datasets to improve the accuracy of the model via Fig. 3. The eligibility criteria specified in the NCT04078425 clinical trial transfer learning. This work verified that the fine-tuning of BioBERT shows better performance in calculating the simi- TABLE III larity between two medical sentences using embedding-based R ANKS AND SCORES OF MATCHING 10 PATIENTS WITH 6 ELIGIBILITY metrics. In future works, we will also explore EMRs structured CRITERIA (NCT04078425) tables in order to significantly improve the performance and iec1 iec2 eec1 eec2 eec3 eec4 Score accuracy of our trial-patient matching framework. P-1 9.46 9.84 -8.47 -2.74 -1.02 -1.02 6.03 P-2 5.02 7.44 -3.43 -3.12 -8.42 -8.42 -10.93 ACKNOWLEDGMENT P-3 9.08 8.65 -10.00 -2.38 -10.00 -10.00 -14.65 P-4 0.00 4.09 -2.96 -5.76 -5.24 -5.24 -15.12 The authors would like to thank Marvin Moughabghab for P-5 3.43 4.02 -6.26 -2.69 -1.09 -1.09 -3.69 his efforts and contributions to this work. P-6 5.19 0.00 0.00 -1.42 0.00 0.00 3.77 P-7 5.65 2.95 -3.86 -2.72 -0.15 -0.15 1.72 R EFERENCES P-8 7.26 10.00 -7.52 -5.76 -6.98 -6.98 -9.99 P-9 6.43 9.14 -4.44 -10.00 -2.70 -2.70 -4.27 [1] H. Dhayne, R. Haque, R. Kilany, and Y. Taher, “In search of big medical P-10 10.00 7.44 -6.27 0.00 -10.00 -10.00 -8.83 data integration solutions-a comprehensive survey,” IEEE Access, vol. 7, pp. 91 265–91 290, 2019. [2] G. De Moor, M. Sundgren, D. Kalra, A. Schmidt, M. Dugas, B. Claer- hout, T. Karakoyun, C. Ohmann, P.-Y. Lastic, N. Ammour et al., “Using A. Text processing electronic health records for clinical research: the case of the ehr4cr project,” Journal of biomedical informatics, vol. 53, pp. 162–173, 2015. MIMIC III Clinical Dataset is a critical care database that [3] M. Dugas, M. Lange, C. Müller-Tidow, P. Kirchhof, and H.-U. Prokosch, contains 2,083,108 medical reports from 46,520 patients. We “Routine data from hospital information systems can support patient recruitment for clinical studies,” Clinical Trials, vol. 7, no. 2, pp. 183– experimented with a randomly selected dataset of 100 Dis- 189, 2010. charge Summaries from patients last visit, excluding patients [4] S. T. Rosenbloom, J. C. Denny, H. Xu, N. Lorenzi, W. W. Stead, and whose ages are under 18. The segmentation stage produces an K. B. Johnson, “Data from clinical notes: a perspective on the tension between structure and flexible documentation,” Journal of the American average of 400 phrases per report. Medical Informatics Association, vol. 18, no. 2, pp. 181–186, 2011. We selected a clinical trial that identifies the role of Aldos- [5] H. Dhayne, R. Kilany, R. Haque, and Y. Taher, “Sedie: A semantic- driven engine for integration of healthcare data,” in 2018 IEEE Interna- terone antagonist in patients of heart failure with preserved tional Conference on Bioinformatics and Biomedicine (BIBM). IEEE, ejection fraction (NCT04078425). Fig. 3 shows the five eligi- 2018, pp. 617–622. bility criteria of this clinical trial. [6] P. Raghavan, J. L. Chen, E. Fosler-Lussier, and A. M. Lai, “How essential are unstructured clinical narratives and information fusion to clinical trial recruitment?” AMIA Summits on Translational Science B. Evaluation of the obtained results Proceedings, vol. 2014, p. 218, 2014. [7] S. W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim, Table III presents the results for a sample of ten patients. In “A practical method for transforming free-text eligibility criteria into computable criteria,” Journal of biomedical informatics, vol. 44, no. 2, order to evaluate the clinical correctness of patients matching pp. 239–250, 2011. to the clinical trial(NCT04078425), a validation task was per- [8] S. Kripalani, F. LeFevre, C. O. Phillips, M. V. Williams, P. Basaviah, formed manually by a nurse and a computer science student. and D. W. Baker, “Deficits in communication and information transfer between hospital-based and primary care physicians: implications for The noteworthy fact is that the evaluation of the matching patient safety and continuity of care,” Jama, vol. 297, no. 8, pp. 831– does not reveal false positives in the score results. Indeed, the 841, 2007. similarity scores reflect the order of matching between patients [9] K. Milian, R. Hoekstra, A. Bucur, A. ten Teije, F. van Harmelen, and J. Paulissen, “Enhancing reuse of structured eligibility criteria and and the clinical trial. The score distribution ranged from (-15) supporting their relaxation,” Journal of biomedical informatics, vol. 56, to (8), and eligible patients to be retained for further screening pp. 205–219, 2015. by experts were those with a score greater than 5. [10] C. Patel, J. Cimino, J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershen- baum, L. Ma, E. Schonberg, and K. Srinivas, “Matching patient records We should note that the scores would be more realistic if to clinical trials using ontologies,” in The Semantic Web. Springer, the segmentation process was more accurate. For instance, the 2007, pp. 816–829. sentence ”you were thought to have a blood clot in your right [11] T. Kang, S. Zhang, Y. Tang, G. W. Hruby, A. Rusanov, N. Elhadad, and C. Weng, “Eliie: An open-source information extraction system leg” was segmented by Metamap into ”a blood clot in your for clinical trial eligibility criteria,” Journal of the American Medical right leg” which would result in a false outcome. Informatics Association, vol. 24, no. 6, pp. 1062–1071, 2017. 28 [12] C. Yuan, P. B. Ryan, C. Ta, Y. Guo, Z. Li, J. Hardin, R. Makadia, P. Jin, [36] A. Romanov and C. Shivade, “Lessons from natural language inference N. Shang, T. Kang et al., “Criteria2query: a natural language interface in the clinical domain,” arXiv preprint arXiv:1808.06752, 2018. to clinical databases for cohort definition,” Journal of the American [37] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval- Medical Informatics Association, vol. 26, no. 4, pp. 294–305, 2019. 2017 task 1: Semantic textual similarity-multilingual and cross-lingual [13] C. Shivade, C. Hebert, M. Lopetegui, M.-C. De Marneffe, E. Fosler- focused evaluation,” arXiv preprint arXiv:1708.00055, 2017. Lussier, and A. M. Lai, “Textual inference for eligibility criteria reso- [38] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghas- lution in clinical trials,” Journal of biomedical informatics, vol. 58, pp. semi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimic-iii, a S211–S218, 2015. freely accessible critical care database,” Scientific data, vol. 3, p. 160035, [14] Y. Ni, S. Kennebeck, J. W. Dexheimer, C. M. McAneney, H. Tang, 2016. T. Lingren, Q. Li, H. Zhai, and I. Solti, “Automated clinical trial eligibility prescreening: increasing the efficiency of patient identification for clinical trials in the emergency department,” Journal of the American Medical Informatics Association, vol. 22, no. 1, pp. 166–178, 2014. [15] D. D. Palmer, “Tokenisation and sentence segmentation,” Handbook of natural language processing, pp. 11–35, 2000. [16] A. R. Aronson, “Effective mapping of biomedical text to the umls metathesaurus: the metamap program.” in Proceedings of the AMIA Symposium. American Medical Informatics Association, 2001, p. 17. [17] A. R. Aronson and F.-M. Lang, “An overview of metamap: historical perspective and recent advances,” Journal of the American Medical Informatics Association, vol. 17, no. 3, pp. 229–236, 2010. [18] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, and W. Nejdl, “A blocking framework for entity resolution in highly heterogeneous information spaces,” IEEE Transactions on Knowledge and Data Engi- neering, vol. 25, no. 12, pp. 2665–2682, 2012. [19] C. Zhou, C. Sun, Z. Liu, and F. Lau, “A c-lstm neural network for text classification,” arXiv preprint arXiv:1511.08630, 2015. [20] S. Moen and T. S. S. Ananiadou, “Distributional semantics resources for biomedical text processing.” [21] G. Soğancıoğlu, H. Öztürk, and A. Özgür, “Biosses: a semantic sentence similarity estimation system for the biomedical domain,” Bioinformatics, vol. 33, no. 14, pp. i49–i58, 2017. [22] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018. [23] Q. Chen, Y. Peng, and Z. Lu, “Biosentvec: creating sentence embeddings for biomedical texts,” arXiv preprint arXiv:1810.09302, 2018. [24] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018. [25] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre- training,” URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018. [26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [27] A. Talman and S. Chatzikyriakidis, “Testing the generalization power of neural network models across nli benchmarks,” in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2019, pp. 85–94. [28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understand- ing,” arXiv preprint arXiv:1906.08237, 2019. [29] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural lan- guage understanding,” arXiv preprint arXiv:1804.07461, 2018. [30] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: pre-trained biomedical language representation model for biomedical text mining,” arXiv preprint arXiv:1901.08746, 2019. [31] H. Xiao, “bert-as-service,” https://github.com/hanxiao/bert-as-service, 2018. [32] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning in natural language processing,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, 2019, pp. 15–18. [33] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019. [34] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” arXiv preprint arXiv:1705.02364, 2017. [35] “google-research/bert: Tensorflow code and pre-trained models for bert,” https://github.com/google-research/bert, (Accessed on 09/17/2019). 29