=Paper=
{{Paper
|id=Vol-2622/paper4
|storemode=property
|title=Using Embedding-based Metrics to Expedite Patients Recruitment Process for Clinical Trials
|pdfUrl=https://ceur-ws.org/Vol-2622/paper4.pdf
|volume=Vol-2622
|authors=Houssein Dhayne,Rima Kilany
|dblpUrl=https://dblp.org/rec/conf/bdcsintell/DhayneK19
}}
==Using Embedding-based Metrics to Expedite Patients Recruitment Process for Clinical Trials==
Using Embedding-based Metrics to expedite
patients recruitment process for clinical trials
1st Houssein Dhayne 2nd Rima Kilany
Faculty of Engineering, ESIB Faculty of Engineering, ESIB
Saint Joseph University Saint Joseph University
Beirut, Lebanon Beirut, Lebanon
houssein.dhayne@net.usj.edu.lb rima.kilany@usj.edu.lb
Abstract—Despite the unprecedented volumes of Electronic the most flexible way for physicians to express case nuances
Medical Records (EMRs) generated daily across healthcare and clinical reasoning [4]. These free texts usually contain
facilities, the ability to leverage these data for patient partici- important facts about patients, but they are rarely available
pation in clinical trial remains overwhelmingly unfulfilled. The
reason behind this is that matching patient information to the for formal queries [5].
eligibility criteria for clinical trials is a manual, effort-consuming On the other hand, eligibility criteria for a clinical trial
process. Therefore, automating this process is an essential step describes the characteristics of patients who are qualified to
in improving the number of patients participating in clinical participate in the trial. Each criterion is usually expressed as
research. To address this issue, we propose a novel framework a descriptive text and specified in the form of inclusion and
for automated patients to clinical trials matching. The matching
process is based on measuring the similarity score between exclusion criteria. Therefore, free text criteria can not always
phrases extracted from patient medical records and the eligibility be transformed into structured data representations.
criterion for a trial. Authors in [6] confirmed that using only structured data
Our solution is based on a combination of NLP techniques from the EMR is insufficient in resolving eligibility criteria
and modern deep learning-based NLP models. In this context, for patient recruitment in clinical trials, and that unstructured
we follow pre-training and transfer learning approaches to help
the model learn task-specific reasoning skills. Additionally, we data is essential to resolve 59% to 77% of the trial criteria.
perform supervised fine-tuning on large Medical Natural Lan- However, matching clinical notes with eligibility criteria is
guage Inference (MedNLI) and Semantic Textual Similarity (STS- still a manually performed task, which makes it an expensive
B) datasets. The matching process was performed at semantic process in terms of time and effort. This slows down clinical
phrases level by converting patient information and trial criteria trials and may delay new drugs from benefiting patients. As a
into vector representations. We then used a scoring function that
combined cosine similarity and scaling normalization to identify consequence, it might entail the loss of human lives that oth-
potential patient-trial matches. The experimental results have erwise would have been able to benefit from new medication.
shown that our framework is highly effective in sorting out For these reasons, automated matching of clinical notes with
patients by their similarity scores. eligibility criteria in the eligibility screening workflow would
Index Terms—NLP, NLI, EMR, Automated clinical trial eligi- help overcome the bottlenecks of pre-screening practices in a
bility screening, BioBERT, Sentence similarity
trial setting.
To tackle the above challenge efficiently, we need to execute
I. I NTRODUCTION
a matching process at a semantic sentence level, rather than by
The widespread adoption and use of electronic medical just checking for the presence or absence of a lexical criterion.
records (EMRs), together with the development of advanced The investigation of the potential use of modern deep learning-
artificial intelligence models, offer remarkable opportunities based NLP(Natural Language Processing) models, led us to
for improving the clinical research sector [1]. Furthermore, propose a framework that would automate the evaluation of
EMRs offer a wide range of potential uses in clinical trials the eligibility of patients to be candidates for a relevant clinical
such as facilitating the clinical trial feasibility assessment and trial. As a first step, the framework splits patient clinical
patient recruitment, as well as obtaining main patient health report and clinical trial sentences into comparatively basic
information and medical history prior to their screening visit. phrase units. Secondly, it classifies the phrases into various
The latter is a critical step in reducing the costs and duration clinical categories (diagnosis, drug, procedure, observation).
of clinical trials [2]. Additionally, linking EMRs with clinical Thirdly, the framework converts candidate phrases into vec-
trials has been shown to increase patient recruitment rate [3]. tor representations using an appropriate deep learning-based
However, there are many barriers to overcome in order to use NLP model. Finally, it calculates a semantic matching score
EMRs for clinical trials. between patients and a clinical trial by using a combination of
Even though EMRs were designed to record information in cosine similarity alongside a scaling normalization method.
a structured format, such as procedure information, diagnosis This paper is organized as follows: In section II, we expose
codes, drug prescriptions, and lab results, free text remains the problem definition and review the related works. In sec-
Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
23
C. Related work
In the recent past, several projects have developed tools
and technologies for automated trial-patient matching. Milian
et al. [9] used a template-based formalism to extract and
represent the semantics of the trial criteria in order to improve
their comparability. Patel et al. [10] formulated the matching
process as a semantic retrieval problem by expressing clinical
trial criterion in the form of semantic query, which a reasoner
Fig. 1. An example of discharge summary contents and format. can then use with a formal medical ontology - SNOMED CT
to retrieve eligible patients. Other works such as EliIE [11] and
Criteria2Query [12] have focused on identifying standardized
tion III, we describe our framework and illustrate the different
medical entities in eligibility criteria using machine learning
challenges. The evaluation of the results and outcomes is
approaches, the extracted entities being then used to query pa-
discussed in section IV. Finally, we conclude this paper in
tient data. Shivade et al. [13] constructed an annotated dataset
section V.
that determined whether the medical note contains text that
II. BACKGROUND meets a criterion or not. Then, they implemented two lexical
A. Problem definition methods and two semantic methods to determine a relevance
score of each sentence with a criterion statement, and found
According to our approach, the problem definition of that semantic methods gave better results than lexical methods.
patient-trial matching can be described as follows: Ni et al [14] evaluated a system using a combination of NLP,
Finding clinical trial participants is the task of matching information retrieval and machine learning methods to identify
Patient Pi (Pi 2 EM R) represented by a Discharge Summary a cohort of patients for clinical trial eligibility pre-screening.
DSi to a Clinical Trial CT represented by an Eligibility Their system relies on both structured data and clinical notes
Criteria EC. Formally, the solution to this task is to find from EMRs.
the top-K highest-values of function M which computes the
matching score denoted by: III. F RAMEWORK OVERVIEW
M(Pi , CT ) = v which represents the score of matching
patient Pi to a CT . In this section, we describe the framework we propose
This list of the top-K highest-scores reduces the overall for automating the matching process between patients and a
number of patients that will need to be screened by clinicians clinical trial. This framework takes into account the following
in order to identify eligible patients. different challenges; (i) In order to treat complex sentences
in patient’s data as well as in clinical trials, we break down
B. Data representation paragraphs into sentences and complex sentences are then
1) Clinical trial: A clinical trial is a type of research that parsed into phrases. These phrases are the basic units for
provides a longstanding foundation in the practice of medicine matching. (ii) To avoid costly comparisons without fault dis-
and the evaluation of new medical treatments. Each trial has missals, phrases are partitioned using classification methods,
eligibility criteria describing the characteristics according to which limits the number of pairs to match. (iii) To match
which a patient or participant must meet all inclusion criteria phrases, we represent them in the form of distributed vectors,
and none of the exclusion criteria. In this respect, the criteria which enables calculating similarity for formally different but
differ from study to study. Authors in [7] analysed 1000 semantically related phrases. Fig. 2 shows an overview of
eligibility criteria and showed that 23% of the criteria are our Patients to Clinical Trial matching framework. Given a
simple, or can be reduced to simple criteria, and that 77% of Clinical Trial CT and set of Patients P, our task is to calculate
the criteria remain complex to evaluate. Therefore, a formally a Matching score M(Pi , CT ).
computable representation of eligibility criteria would require
natural language processing techniques as part of automated A. Paragraph and sentence decomposition
screening for patient eligibility. In order to measure the similarity between two sentences,
2) Patient medical records: An EMR typically collects var- we have to deal with a simple sentence representing a
ious types of patient information, including patient discharge linguistically-meaningful unit. This process requires segment-
summaries, prior diagnoses, radiology reports, medication his- ing both paragraph-level and sentence-level structures into
tory, and so on. Hospital discharge summaries are a physician- phrase-level structures. According to [15], segmentation of
authored synopsis of a patient’s hospital stay, which serve paragraphs and sentences is the process of parsing the longer
as the main documents communicating a patients care plan processing units, consisting of one or more words, to further
to the post-hospital care team [8]. Discharge summaries are processing stages such as part-of-speech parsers, morpholog-
organized in several sections. These sections usually include ical analyzers, etc.
past medical history and history of present illness as shown In our model, we handle each phrase as a primitive semantic
in fig.1. unit and find matching phrases between patient and clinical
24
TABLE I selected in our exploration, with a Precision of 0.87, a Recall
E XAMPLE OF SENTENCES SEGMENTATION INTO PHRASES of 0.88, and a F1-score of 0.875. We therefore adopted CNN +
Paragraph Phrases PubMed-and-PMC-w2v to perform this classification task and
Eligibility Crieteria NCT03484780 were able to categorize the phrases into the four pre-mentioned
Previous open laparotomy 1- Previous open laparotomy categories.
or contraindications to 2- contraindications to laparoscopy
laparoscopy, as determined by 3- determined by implanting C. Phrase vector representations
implanting physician. physician
Discharge Summary The purpose of this work is to allow the matching of patients
History of paroxysmal atrial
1- History of paroxysmal atrial data and clinical trials by comparing unstructured data from
fibrillation both datasets. Our claim is that by measuring the similarity
fibrillation with anticoagulation
2- with anticoagulation in the past.
in the past. History of coronary
3- History of coronary artery of primitive semantic medical units (medical phrases) of a
artery disease status post
myocardial infarction
disease patient’s Discharge Summary and Eligibility Criteria, we can
4- status post myocardial infarction generate a score value supporting the matching task.
There are plenty of measures of semantic similarity between
sentences used in NLP. Unsupervised and supervised methods
trials by calculating the similarity of each phrase in the have been used to calculate the semantic similarity between
discharge summary to each phrase in Eligibility Criteria (EC). two sentences in the biomedical domain [21]. Recently, a
We used paragraph and sentence segmentation of number of novel approaches have been proposed to address
MetaMap [16]. MetaMap was provided by the National this problem by producing sentence vectors [22]. As an
Library of Medicine (NLM) to map Medical Language example, Neural sentence-embedding methods [23] have been
Processor (MLP) text to the UMLS Metathesaurus shown to outperform traditional approaches, such as TF-IDF
concepts [17]. MetaMap breaks text into paragraphs, and word overlap based measures.
sentences, and then phrases. Table I presents a simple 1) Universal sentence embeddings: The concept of uni-
example of segmenting sentences into phrases. The first refers versal sentence embeddings has grown in popularity as it
to the eligibility criteria (NCT03484780) and the second leverages models trained on large text corpora. These pre-
illustrates an example from a patient discharge summary. trained models can be used in a wide range of downstream
B. Phrases classification tasks, such as providing versatile sentence-embedding models
that convert sentences into vector representations. Notable
A discharge summary report contains information about works include ELMo [24], GPT [25], and BERT [26].
different topics. Therefore, the large number of heterogeneous 2) BioBERT: BERT (Bidirectional Encoder Representa-
phrases extracted from the patient reports may affect the tions from Transformers) is a neural network language model
efficiency and effectiveness of pairwise phrase matching [18]. trained on plain text for masked word prediction and next sen-
To minimize the number of required comparisons, we tence prediction tasks. BERT applies multi-layer bidirectional
applied a filtering methodology. The latter aims to filter all transformer encoder with self-attention. According to [27],
the classes of phrases that do not correspond to a given class, BERT overall achieved state-of-the-art performances in many
which limits the number of pairs to match. Natural Language Processing tasks and was significantly better
Data classification techniques could support achieving this than other models. However, compared against more recent
filtering by separating phrases extracted from patient data and models, XLNet [28] outperforms BERT and achieves better
clinical trial into different medical categories. This classifica- prediction metrics on the GLUE benchmark [29], but is not yet
tion filters-out non-matching pairs prior to verification, which widely used in the medical field. Applying the same architec-
increases the efficiency of phrases similarity matching with ture as BERT, Lee et al. [30] proposed the BioBERT language
high precision and without sacrificing recall. model trained on biomedical corpora including PubMED and
In our study, a total of 1500 eligibility criteria were extracted PMC. The BioBERT model showed promising results in the
from a Clinical Trials database1 and were manually labelled by biomedical domain.
a certified nurse and a data science master student according 3) Phrase embedding: In this respect, to generate context-
to four classes (diagnosis, drug, procedure, observation). rich phrase embeddings, we chose BioBERT as the language
In this work, we have empirically explored and compared model in conjunction with the Bert-as-service library [31].
four methods widely used in classification as our baseline: Bert-as-service is a feature extraction service based on BERT
SVM, CNN, LSTM, C-LSTM [19], in order to identify the which uses two strategies to derive a fixed-sized vector. In the
ones with the best performance. For SVM and CNN models, default strategy, Bert-as-service does average pooling of all
we initialized word embeddings by the average of the word of the tokens of second-to-last hidden layer, while the second
embedding over all words in the sentence via PubMed-and- uses the output of the special CLS token and is recommended
PMC-w2v [20]. only after fine-tuning BERT on a downstream task.
Our experiment indicates that CNN + w2v model has the
D. Phrases Similarity Measures
best prediction performance in comparison to the other models
The similarity between two vectors can be evaluated using
1 https://clinicaltrials.gov/ various similarity measures such as Cosine similarity, Eu-
25
Phrases Phrase Phrase Maximum Ranking
Segmentation Classification Embedding Cosine &
Discharge similarity Scoring
Summary
Diagnosis Pairwise Cosine
Drug
Distance
Clinical Trial
Procedure iec1 iec2 iec3 eec4 ec1 ec2 ec3 ec4 Score
P1 0.9 0.6 0.3 0.5 P1 5.0 3.3 1.4 -2.5 7.3
P2 0.7 0.3 0.1 0.8 P2 3.8 0.8 0.0 -4.4 0.2
Eligibility P3 0.2 0.4 0.2 0.1 P3 0.6 1.7 0.7 0.0 3.0
P4 0.5 0.8 0.7 0.2 P4 2.5 5.0 4.3 -0.6 11.2
Criteria Diagnosis P5 0.1 0.2 0.8 0.9 P5 0.0 0.0 5.0 -5.0 0.0
Drug
Max similarity between Patients' similarity
Fine-tuned
Procedure
BioBERT Patient_5 phrases and ranking regarding
Eligibility_Criteria-1 EC1
Fig. 2. Framework overview
clidean distance, and Manhattan distance. Since these simi- reducing thus the need for many heavily-engineered task-
larity metrics are a linear space in which all dimensions are specific architectures.
weighted equally, we perform here the similarity matching In the context of natural language understanding (NLU)
metrics of different phrases by ranking these phrases according technology, comparing the relationship between two sen-
to the cosine similarity. Therefore, the rank of similarity can tences is based on several downstream tasks such as Natural
be obtained by the equations presented in (1) and (2).
Clinical Trial Language Inference (NLI) and Semantic Textual Similarity
iec1 iec2 iec3 eec4 (STS) [29]. Besides that, authors in [33] have shown that
x ·y
P1 0.9
x, y )0.3
cos(x
0.6
= (1) fine-tuning BERT on NLI and STS datasets creates sentence
x0.5
||x || · ||yy ||
P2 0.7 0.3 0.1 0.8 embeddings which achieve an improvement of 11.7 points
P3 0.2 0.4 0.2 0.1 compared to InferSent [34] and 5.5 points compared to the
P4 0.5
if 0.8 A, B
cos(A 0.7) > 0.2
cos(A A, C ) (2) Universal Sentence Encoder [22]. In this context, we first
P5 0.1 0.2 0.8 0.9
then A is more similar to B than C . fine-tuned BioBERT on STS-B dataset that generated our
BioBERT-based model. We then further fine-tuned on MedNLI
Whereas a pre-trained BioBERT knowledge often shows a dataset. We used the fine-tuning classifier from BERT sys-
ec1 ec2 ec3 ec4 Score
good performance for certain tasks, as we shall see later on, tems [35].
this prior knowledge is not sufficient to compute the similarity
P1 5.0 3.3 1.4 -2.5 7.3
• MedNLI [36]: is a large, publicly available, expert anno-
P2 3.8 0.8 0.0 -4.4 0.2
of sentencesP3based
0.6 on 1.7
their embeddings.
0.7 0.0 3.0Indeed, we first tried tated dataset drawn from the medical history section of
to compute P4the 2.5
cosine5.0similarity
4.3 of sentences,
-0.6 11.2 annotated by MIMIC-III. MedNLI includes a set of clinical sentence
P5 0.0 0.0 5.0 -5.0 0.0
experts, using extracted embedding from pre-trained BioBert, pairs(14,049 pairs). They were annotated with one of
without any fine-tuning. The result of the comparison was three classes: entailment, contradiction, and neutral.
unsatisfactory and unacceptable (table II). The most significant • STS-B [37]: is a collection of sentence pairs selected from
sentence is the exact opposite, for example; the most similar news headlines. The dataset consists of paired sentences
sentence of ”History of CVA” was ”patient has normal brain (8,628 pairs) labelled by humans with a similarity score
MRI” with similarity value of 0.91 which was annotated of 1 to 5 denoting how similar the two sentences are in
by experts as ”contradiction”, and the ”Entailment” sentence terms of semantic meaning.
”patient has history of stroke” appears in the second place 2) Evaluation of fine-tuned BioBERT: We evaluated the
with similarity value of 0.89. Therefore, foregoing experiments new BioBERT model by computing the cosine similarity
reinforced our belief that it is necessary to fine-tune BioBERT between the phrase embeddings. We observed that the model,
on our downstream task. was not just able to rank phrases in terms of similarity, but
1) Supervised Fine-tuning: Transfer learning is the process also gave a more appropriate cosine value. A representative
of extending a pre-trained model by leveraging data from an sample of the results is depicted in Table II.
additional domain for a better model generalization [32]. The
most common transfer learning techniques in NLP is fine- E. Matching Patients to Clinical Trials
tuning. Fine-tuning involves copying the weights from a pre- After fine-tuning the BioBERT model for optimized cosine
trained network and tuning them using labeled data from the similarity and creating both Discharge Summary and Clinical
downstream tasks. BERT is a fine-tuning based representation Trial phrases embeddings, we proceeded to find Clinical Trial
model that achieves state-of-the-art performance on a large participants from an EMR dataset.
suite of sentence-level tasks, with pre-trained representations Formally, we denote:
26
TABLE II
NLI AND C OS SIMILARITY BEFORE AND AFTER FINE - TUNING OF B IO BERT
Experts Pre-trained BioBERT Fine-tuned BioBERT
Phrase 1 (P1) Phrase 2 (P2) NLI(P1, P2) Cos(P1, P2) Rank Cos(P1, P2) Rank
patient has history of stroke Entailment 0.89 1.53 0.87 3.00
History of CVA patient has normal brain mri Contradiction 0.91 3.00 0.75 0.00
patient is hemiplegic Neutral 0.86 0.00 0.77 0.38
Patient has abnormal EKG findings. Entailment 0.89 2.05 0.82 3.00
Per report ECG with initial qtc of 410
now 475, QRS 82 initially, now 86 Patient has normal EKG. Contradiction 0.90 3.00 0.80 2.30
rate= 95.
Patient has angina. Neutral 0.88 0.00 0.73 0.00
History of hypercholesterolemia and the patient was in a MVC. Entailment 0.89 3.00 0.82 3.00
peptic ulcer disease s/p gastric bypass
the patient has no medical history. Contradiction 0.88 2.48 0.53 0.00
some years ago was involved in a low-
speed MVC. the patient has no significant injuries. Neutral 0.86 0.00 0.67 1.51
• DSi = {phi,1 , phi,2 , ..., phi,r } as the phrases extracted different features (eligibility criteria) in the generated matrix
from Discharge Summary of patient Pi . S, we notice that just because the value of similarity is higher,
• IEC = {iec1 , iec2 , ..., iecp } as the phrases extracted that does not mean that the similarity with the patient is
from Inclusion Eligibility Criteria. greater. For example if sx,1 and sy,2 represent the highest value
• EEC = {eec1 , eec2 , ..., eecq } as the phrases extracted of the features ec1 and ec2 , respectively, and if sx,1 > sy,2 ,
from Exclusion Eligibility Criteria. this does not mean that Px has a phrase more similar to ec1
• EC = {ec1 , ec2 , ..., ecl } = IEC [ EEC | l = p + q as than Py for ec2 (as a noticed in equation 2), but only means
all phrases extracted from Eligibility Criteria. that Px and Py are ranked respectively at the top similar of
n⇤l
• S 2 [0, 1] as the cosine Similarity matrix, where the list for ec1 and ec2 . The same logic applies for the lowest
n and l are the number of Patients and EC elements, value, which represents the last order of similarity.
respectively. This variation in the similarity values between features
1) Matching Patient to Eligibility Criteria: Once phrases requires a range normalization step to enable rank similarity
embedding are computed for the patients and the clinical instead of cosine similarity, which supports perfectly the
trial eligibility criteria, we calculate the similarity between computation of a matching score between patients and the
phrases of the same class (Diagnosis, Drug, Procedure,... ) as Clinical Trial. To this end, we generated a new matrix R by
defined in sub-section III-B. An element si,j of S represents applying the following feature scaling normalization:
the similarity between patient criteria Pi and single eligibility
criteria ecj . The similarity function is defined by calculating ( s min (s )
the cosine between each phrase phi,r extracted from DSi and n ⇥ max8ii,j(si,j ) 8i i,j
min8i (si,j ) ; ecj 2 IEC
ri,j = s min (s )
ecj , then only the higher cosine value of similarity is retained ( n) ⇥ max8ii,j(si,j ) 8i i,j
min8i (si,j ) ; ecj 2 EEC
for si,j and all other values are discarded. (4)
Finally, the matching score M of Patient Pi with a Clinical
si,j = max (cos(phi,r , ecj )) (3) Trial is determined by:
8phi,r 2DSi
i 2 [1, n] &j 2 [1, l] l
X
M(Pi , CT ) = rij. (5)
Once the similarity values obtained, the final representation
j=1
of S would be as follows:
IV. E VALUATION
2 3
maxph1,r (cos(ph1,r , ec1 )) .
To validate our framework, we used two datasets; MIMIC-
S =4 . . 5
III (Medical Information Mart for Intensive Care) [38] com-
. maxphn,r (cos(phn,r , ecl ))
prising information relating to patients admitted to critical care
2) Ranking and Scoring Patients: The semantic cosine units, and Clinical Trials 2 a Web-based resource providing
similarity calculated in the previous paragraph enables a access to information on supported clinical studies.
proportional similarity instead of exact text semantic matching.
Therefore, when we compare similarity values obtained for 2 https://clinicaltrials.gov/
27
V. C ONCLUSION
EMRs contain a large portion of unstructured data that need
to be matched with eligibility criteria for trial-patient enroll-
ment. Indeed, the gradual improvement of artificial intelligence
technology could reduce the number of physician-hours spent
in screening patient eligibility. To tackle the problem, we pro-
posed a framework designed to automatically recommend the
most suitable patients for a clinical trial. The framework adopts
a pre-trained language model (BioBERT) and uses STS-B and
MedNLI datasets to improve the accuracy of the model via
Fig. 3. The eligibility criteria specified in the NCT04078425 clinical trial transfer learning. This work verified that the fine-tuning of
BioBERT shows better performance in calculating the simi-
TABLE III larity between two medical sentences using embedding-based
R ANKS AND SCORES OF MATCHING 10 PATIENTS WITH 6 ELIGIBILITY metrics. In future works, we will also explore EMRs structured
CRITERIA (NCT04078425)
tables in order to significantly improve the performance and
iec1 iec2 eec1 eec2 eec3 eec4 Score accuracy of our trial-patient matching framework.
P-1 9.46 9.84 -8.47 -2.74 -1.02 -1.02 6.03
P-2 5.02 7.44 -3.43 -3.12 -8.42 -8.42 -10.93 ACKNOWLEDGMENT
P-3 9.08 8.65 -10.00 -2.38 -10.00 -10.00 -14.65
P-4 0.00 4.09 -2.96 -5.76 -5.24 -5.24 -15.12 The authors would like to thank Marvin Moughabghab for
P-5 3.43 4.02 -6.26 -2.69 -1.09 -1.09 -3.69 his efforts and contributions to this work.
P-6 5.19 0.00 0.00 -1.42 0.00 0.00 3.77
P-7 5.65 2.95 -3.86 -2.72 -0.15 -0.15 1.72 R EFERENCES
P-8 7.26 10.00 -7.52 -5.76 -6.98 -6.98 -9.99
P-9 6.43 9.14 -4.44 -10.00 -2.70 -2.70 -4.27 [1] H. Dhayne, R. Haque, R. Kilany, and Y. Taher, “In search of big medical
P-10 10.00 7.44 -6.27 0.00 -10.00 -10.00 -8.83 data integration solutions-a comprehensive survey,” IEEE Access, vol. 7,
pp. 91 265–91 290, 2019.
[2] G. De Moor, M. Sundgren, D. Kalra, A. Schmidt, M. Dugas, B. Claer-
hout, T. Karakoyun, C. Ohmann, P.-Y. Lastic, N. Ammour et al., “Using
A. Text processing electronic health records for clinical research: the case of the ehr4cr
project,” Journal of biomedical informatics, vol. 53, pp. 162–173, 2015.
MIMIC III Clinical Dataset is a critical care database that [3] M. Dugas, M. Lange, C. Müller-Tidow, P. Kirchhof, and H.-U. Prokosch,
contains 2,083,108 medical reports from 46,520 patients. We “Routine data from hospital information systems can support patient
recruitment for clinical studies,” Clinical Trials, vol. 7, no. 2, pp. 183–
experimented with a randomly selected dataset of 100 Dis- 189, 2010.
charge Summaries from patients last visit, excluding patients [4] S. T. Rosenbloom, J. C. Denny, H. Xu, N. Lorenzi, W. W. Stead, and
whose ages are under 18. The segmentation stage produces an K. B. Johnson, “Data from clinical notes: a perspective on the tension
between structure and flexible documentation,” Journal of the American
average of 400 phrases per report. Medical Informatics Association, vol. 18, no. 2, pp. 181–186, 2011.
We selected a clinical trial that identifies the role of Aldos- [5] H. Dhayne, R. Kilany, R. Haque, and Y. Taher, “Sedie: A semantic-
driven engine for integration of healthcare data,” in 2018 IEEE Interna-
terone antagonist in patients of heart failure with preserved tional Conference on Bioinformatics and Biomedicine (BIBM). IEEE,
ejection fraction (NCT04078425). Fig. 3 shows the five eligi- 2018, pp. 617–622.
bility criteria of this clinical trial. [6] P. Raghavan, J. L. Chen, E. Fosler-Lussier, and A. M. Lai, “How
essential are unstructured clinical narratives and information fusion to
clinical trial recruitment?” AMIA Summits on Translational Science
B. Evaluation of the obtained results Proceedings, vol. 2014, p. 218, 2014.
[7] S. W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim,
Table III presents the results for a sample of ten patients. In “A practical method for transforming free-text eligibility criteria into
computable criteria,” Journal of biomedical informatics, vol. 44, no. 2,
order to evaluate the clinical correctness of patients matching pp. 239–250, 2011.
to the clinical trial(NCT04078425), a validation task was per- [8] S. Kripalani, F. LeFevre, C. O. Phillips, M. V. Williams, P. Basaviah,
formed manually by a nurse and a computer science student. and D. W. Baker, “Deficits in communication and information transfer
between hospital-based and primary care physicians: implications for
The noteworthy fact is that the evaluation of the matching patient safety and continuity of care,” Jama, vol. 297, no. 8, pp. 831–
does not reveal false positives in the score results. Indeed, the 841, 2007.
similarity scores reflect the order of matching between patients [9] K. Milian, R. Hoekstra, A. Bucur, A. ten Teije, F. van Harmelen,
and J. Paulissen, “Enhancing reuse of structured eligibility criteria and
and the clinical trial. The score distribution ranged from (-15) supporting their relaxation,” Journal of biomedical informatics, vol. 56,
to (8), and eligible patients to be retained for further screening pp. 205–219, 2015.
by experts were those with a score greater than 5. [10] C. Patel, J. Cimino, J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershen-
baum, L. Ma, E. Schonberg, and K. Srinivas, “Matching patient records
We should note that the scores would be more realistic if to clinical trials using ontologies,” in The Semantic Web. Springer,
the segmentation process was more accurate. For instance, the 2007, pp. 816–829.
sentence ”you were thought to have a blood clot in your right [11] T. Kang, S. Zhang, Y. Tang, G. W. Hruby, A. Rusanov, N. Elhadad,
and C. Weng, “Eliie: An open-source information extraction system
leg” was segmented by Metamap into ”a blood clot in your for clinical trial eligibility criteria,” Journal of the American Medical
right leg” which would result in a false outcome. Informatics Association, vol. 24, no. 6, pp. 1062–1071, 2017.
28
[12] C. Yuan, P. B. Ryan, C. Ta, Y. Guo, Z. Li, J. Hardin, R. Makadia, P. Jin, [36] A. Romanov and C. Shivade, “Lessons from natural language inference
N. Shang, T. Kang et al., “Criteria2query: a natural language interface in the clinical domain,” arXiv preprint arXiv:1808.06752, 2018.
to clinical databases for cohort definition,” Journal of the American [37] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-
Medical Informatics Association, vol. 26, no. 4, pp. 294–305, 2019. 2017 task 1: Semantic textual similarity-multilingual and cross-lingual
[13] C. Shivade, C. Hebert, M. Lopetegui, M.-C. De Marneffe, E. Fosler- focused evaluation,” arXiv preprint arXiv:1708.00055, 2017.
Lussier, and A. M. Lai, “Textual inference for eligibility criteria reso- [38] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghas-
lution in clinical trials,” Journal of biomedical informatics, vol. 58, pp. semi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimic-iii, a
S211–S218, 2015. freely accessible critical care database,” Scientific data, vol. 3, p. 160035,
[14] Y. Ni, S. Kennebeck, J. W. Dexheimer, C. M. McAneney, H. Tang, 2016.
T. Lingren, Q. Li, H. Zhai, and I. Solti, “Automated clinical trial
eligibility prescreening: increasing the efficiency of patient identification
for clinical trials in the emergency department,” Journal of the American
Medical Informatics Association, vol. 22, no. 1, pp. 166–178, 2014.
[15] D. D. Palmer, “Tokenisation and sentence segmentation,” Handbook of
natural language processing, pp. 11–35, 2000.
[16] A. R. Aronson, “Effective mapping of biomedical text to the umls
metathesaurus: the metamap program.” in Proceedings of the AMIA
Symposium. American Medical Informatics Association, 2001, p. 17.
[17] A. R. Aronson and F.-M. Lang, “An overview of metamap: historical
perspective and recent advances,” Journal of the American Medical
Informatics Association, vol. 17, no. 3, pp. 229–236, 2010.
[18] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, and W. Nejdl,
“A blocking framework for entity resolution in highly heterogeneous
information spaces,” IEEE Transactions on Knowledge and Data Engi-
neering, vol. 25, no. 12, pp. 2665–2682, 2012.
[19] C. Zhou, C. Sun, Z. Liu, and F. Lau, “A c-lstm neural network for text
classification,” arXiv preprint arXiv:1511.08630, 2015.
[20] S. Moen and T. S. S. Ananiadou, “Distributional semantics resources
for biomedical text processing.”
[21] G. Soğancıoğlu, H. Öztürk, and A. Özgür, “Biosses: a semantic sentence
similarity estimation system for the biomedical domain,” Bioinformatics,
vol. 33, no. 14, pp. i49–i58, 2017.
[22] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John,
N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal
sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.
[23] Q. Chen, Y. Peng, and Z. Lu, “Biosentvec: creating sentence embeddings
for biomedical texts,” arXiv preprint arXiv:1810.09302, 2018.
[24] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
and L. Zettlemoyer, “Deep contextualized word representations,” arXiv
preprint arXiv:1802.05365, 2018.
[25] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
“Improving language understanding by generative pre-
training,” URL https://s3-us-west-2. amazonaws. com/openai-
assets/researchcovers/languageunsupervised/language understanding
paper. pdf, 2018.
[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[27] A. Talman and S. Chatzikyriakidis, “Testing the generalization power
of neural network models across nli benchmarks,” in Proceedings of the
2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP, 2019, pp. 85–94.
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le,
“Xlnet: Generalized autoregressive pretraining for language understand-
ing,” arXiv preprint arXiv:1906.08237, 2019.
[29] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman,
“Glue: A multi-task benchmark and analysis platform for natural lan-
guage understanding,” arXiv preprint arXiv:1804.07461, 2018.
[30] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang,
“Biobert: pre-trained biomedical language representation model for
biomedical text mining,” arXiv preprint arXiv:1901.08746, 2019.
[31] H. Xiao, “bert-as-service,” https://github.com/hanxiao/bert-as-service,
2018.
[32] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning
in natural language processing,” in Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational
Linguistics: Tutorials, 2019, pp. 15–18.
[33] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using
siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
[34] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes,
“Supervised learning of universal sentence representations from natural
language inference data,” arXiv preprint arXiv:1705.02364, 2017.
[35] “google-research/bert: Tensorflow code and pre-trained models for bert,”
https://github.com/google-research/bert, (Accessed on 09/17/2019).
29