Using Embedding-based Metrics to expedite
           patients recruitment process for clinical trials
                                  1st Houssein Dhayne                                            2nd Rima Kilany
                             Faculty of Engineering, ESIB                                 Faculty of Engineering, ESIB
                                Saint Joseph University                                      Saint Joseph University
                                   Beirut, Lebanon                                              Beirut, Lebanon
                            houssein.dhayne@net.usj.edu.lb                                   rima.kilany@usj.edu.lb


   Abstract—Despite the unprecedented volumes of Electronic                     the most flexible way for physicians to express case nuances
Medical Records (EMRs) generated daily across healthcare                        and clinical reasoning [4]. These free texts usually contain
facilities, the ability to leverage these data for patient partici-             important facts about patients, but they are rarely available
pation in clinical trial remains overwhelmingly unfulfilled. The
reason behind this is that matching patient information to the                  for formal queries [5].
eligibility criteria for clinical trials is a manual, effort-consuming             On the other hand, eligibility criteria for a clinical trial
process. Therefore, automating this process is an essential step                describes the characteristics of patients who are qualified to
in improving the number of patients participating in clinical                   participate in the trial. Each criterion is usually expressed as
research. To address this issue, we propose a novel framework                   a descriptive text and specified in the form of inclusion and
for automated patients to clinical trials matching. The matching
process is based on measuring the similarity score between                      exclusion criteria. Therefore, free text criteria can not always
phrases extracted from patient medical records and the eligibility              be transformed into structured data representations.
criterion for a trial.                                                             Authors in [6] confirmed that using only structured data
   Our solution is based on a combination of NLP techniques                     from the EMR is insufficient in resolving eligibility criteria
and modern deep learning-based NLP models. In this context,                     for patient recruitment in clinical trials, and that unstructured
we follow pre-training and transfer learning approaches to help
the model learn task-specific reasoning skills. Additionally, we                data is essential to resolve 59% to 77% of the trial criteria.
perform supervised fine-tuning on large Medical Natural Lan-                       However, matching clinical notes with eligibility criteria is
guage Inference (MedNLI) and Semantic Textual Similarity (STS-                  still a manually performed task, which makes it an expensive
B) datasets. The matching process was performed at semantic                     process in terms of time and effort. This slows down clinical
phrases level by converting patient information and trial criteria              trials and may delay new drugs from benefiting patients. As a
into vector representations. We then used a scoring function that
combined cosine similarity and scaling normalization to identify                consequence, it might entail the loss of human lives that oth-
potential patient-trial matches. The experimental results have                  erwise would have been able to benefit from new medication.
shown that our framework is highly effective in sorting out                     For these reasons, automated matching of clinical notes with
patients by their similarity scores.                                            eligibility criteria in the eligibility screening workflow would
   Index Terms—NLP, NLI, EMR, Automated clinical trial eligi-                   help overcome the bottlenecks of pre-screening practices in a
bility screening, BioBERT, Sentence similarity
                                                                                trial setting.
                                                                                   To tackle the above challenge efficiently, we need to execute
                            I. I NTRODUCTION
                                                                                a matching process at a semantic sentence level, rather than by
   The widespread adoption and use of electronic medical                        just checking for the presence or absence of a lexical criterion.
records (EMRs), together with the development of advanced                       The investigation of the potential use of modern deep learning-
artificial intelligence models, offer remarkable opportunities                  based NLP(Natural Language Processing) models, led us to
for improving the clinical research sector [1]. Furthermore,                    propose a framework that would automate the evaluation of
EMRs offer a wide range of potential uses in clinical trials                    the eligibility of patients to be candidates for a relevant clinical
such as facilitating the clinical trial feasibility assessment and              trial. As a first step, the framework splits patient clinical
patient recruitment, as well as obtaining main patient health                   report and clinical trial sentences into comparatively basic
information and medical history prior to their screening visit.                 phrase units. Secondly, it classifies the phrases into various
The latter is a critical step in reducing the costs and duration                clinical categories (diagnosis, drug, procedure, observation).
of clinical trials [2]. Additionally, linking EMRs with clinical                Thirdly, the framework converts candidate phrases into vec-
trials has been shown to increase patient recruitment rate [3].                 tor representations using an appropriate deep learning-based
However, there are many barriers to overcome in order to use                    NLP model. Finally, it calculates a semantic matching score
EMRs for clinical trials.                                                       between patients and a clinical trial by using a combination of
   Even though EMRs were designed to record information in                      cosine similarity alongside a scaling normalization method.
a structured format, such as procedure information, diagnosis                      This paper is organized as follows: In section II, we expose
codes, drug prescriptions, and lab results, free text remains                   the problem definition and review the related works. In sec-

 Copyright © 2019 for this paper by its authors. Use permitted under Creative
 Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                                       23
                                                                    C. Related work
                                                                       In the recent past, several projects have developed tools
                                                                    and technologies for automated trial-patient matching. Milian
                                                                    et al. [9] used a template-based formalism to extract and
                                                                    represent the semantics of the trial criteria in order to improve
                                                                    their comparability. Patel et al. [10] formulated the matching
                                                                    process as a semantic retrieval problem by expressing clinical
                                                                    trial criterion in the form of semantic query, which a reasoner
     Fig. 1. An example of discharge summary contents and format.   can then use with a formal medical ontology - SNOMED CT
                                                                    to retrieve eligible patients. Other works such as EliIE [11] and
                                                                    Criteria2Query [12] have focused on identifying standardized
tion III, we describe our framework and illustrate the different
                                                                    medical entities in eligibility criteria using machine learning
challenges. The evaluation of the results and outcomes is
                                                                    approaches, the extracted entities being then used to query pa-
discussed in section IV. Finally, we conclude this paper in
                                                                    tient data. Shivade et al. [13] constructed an annotated dataset
section V.
                                                                    that determined whether the medical note contains text that
                        II. BACKGROUND                              meets a criterion or not. Then, they implemented two lexical
A. Problem definition                                               methods and two semantic methods to determine a relevance
                                                                    score of each sentence with a criterion statement, and found
   According to our approach, the problem definition of             that semantic methods gave better results than lexical methods.
patient-trial matching can be described as follows:                 Ni et al [14] evaluated a system using a combination of NLP,
   Finding clinical trial participants is the task of matching      information retrieval and machine learning methods to identify
Patient Pi (Pi 2 EM R) represented by a Discharge Summary           a cohort of patients for clinical trial eligibility pre-screening.
DSi to a Clinical Trial CT represented by an Eligibility            Their system relies on both structured data and clinical notes
Criteria EC. Formally, the solution to this task is to find         from EMRs.
the top-K highest-values of function M which computes the
matching score denoted by:                                                          III. F RAMEWORK OVERVIEW
   M(Pi , CT ) = v which represents the score of matching
patient Pi to a CT .                                                   In this section, we describe the framework we propose
   This list of the top-K highest-scores reduces the overall        for automating the matching process between patients and a
number of patients that will need to be screened by clinicians      clinical trial. This framework takes into account the following
in order to identify eligible patients.                             different challenges; (i) In order to treat complex sentences
                                                                    in patient’s data as well as in clinical trials, we break down
B. Data representation                                              paragraphs into sentences and complex sentences are then
   1) Clinical trial: A clinical trial is a type of research that   parsed into phrases. These phrases are the basic units for
provides a longstanding foundation in the practice of medicine      matching. (ii) To avoid costly comparisons without fault dis-
and the evaluation of new medical treatments. Each trial has        missals, phrases are partitioned using classification methods,
eligibility criteria describing the characteristics according to    which limits the number of pairs to match. (iii) To match
which a patient or participant must meet all inclusion criteria     phrases, we represent them in the form of distributed vectors,
and none of the exclusion criteria. In this respect, the criteria   which enables calculating similarity for formally different but
differ from study to study. Authors in [7] analysed 1000            semantically related phrases. Fig. 2 shows an overview of
eligibility criteria and showed that 23% of the criteria are        our Patients to Clinical Trial matching framework. Given a
simple, or can be reduced to simple criteria, and that 77% of       Clinical Trial CT and set of Patients P, our task is to calculate
the criteria remain complex to evaluate. Therefore, a formally      a Matching score M(Pi , CT ).
computable representation of eligibility criteria would require
natural language processing techniques as part of automated         A. Paragraph and sentence decomposition
screening for patient eligibility.                                     In order to measure the similarity between two sentences,
   2) Patient medical records: An EMR typically collects var-       we have to deal with a simple sentence representing a
ious types of patient information, including patient discharge      linguistically-meaningful unit. This process requires segment-
summaries, prior diagnoses, radiology reports, medication his-      ing both paragraph-level and sentence-level structures into
tory, and so on. Hospital discharge summaries are a physician-      phrase-level structures. According to [15], segmentation of
authored synopsis of a patient’s hospital stay, which serve         paragraphs and sentences is the process of parsing the longer
as the main documents communicating a patients care plan            processing units, consisting of one or more words, to further
to the post-hospital care team [8]. Discharge summaries are         processing stages such as part-of-speech parsers, morpholog-
organized in several sections. These sections usually include       ical analyzers, etc.
past medical history and history of present illness as shown           In our model, we handle each phrase as a primitive semantic
in fig.1.                                                           unit and find matching phrases between patient and clinical


                                                                                                                                         24
                             TABLE I                                           selected in our exploration, with a Precision of 0.87, a Recall
         E XAMPLE OF SENTENCES SEGMENTATION INTO PHRASES                       of 0.88, and a F1-score of 0.875. We therefore adopted CNN +
 Paragraph                              Phrases                                PubMed-and-PMC-w2v to perform this classification task and
                     Eligibility Crieteria NCT03484780                         were able to categorize the phrases into the four pre-mentioned
 Previous open laparotomy               1- Previous open laparotomy            categories.
 or contraindications to                2- contraindications to laparoscopy
 laparoscopy, as determined by          3- determined by implanting            C. Phrase vector representations
 implanting physician.                  physician
                             Discharge Summary                                    The purpose of this work is to allow the matching of patients
 History of paroxysmal atrial
                                        1- History of paroxysmal atrial        data and clinical trials by comparing unstructured data from
                                        fibrillation                           both datasets. Our claim is that by measuring the similarity
 fibrillation with anticoagulation
                                        2- with anticoagulation in the past.
 in the past. History of coronary
                                        3- History of coronary artery          of primitive semantic medical units (medical phrases) of a
 artery disease status post
 myocardial infarction
                                        disease                                patient’s Discharge Summary and Eligibility Criteria, we can
                                        4- status post myocardial infarction   generate a score value supporting the matching task.
                                                                                  There are plenty of measures of semantic similarity between
                                                                               sentences used in NLP. Unsupervised and supervised methods
trials by calculating the similarity of each phrase in the                     have been used to calculate the semantic similarity between
discharge summary to each phrase in Eligibility Criteria (EC).                 two sentences in the biomedical domain [21]. Recently, a
   We used paragraph and sentence segmentation of                              number of novel approaches have been proposed to address
MetaMap [16]. MetaMap was provided by the National                             this problem by producing sentence vectors [22]. As an
Library of Medicine (NLM) to map Medical Language                              example, Neural sentence-embedding methods [23] have been
Processor (MLP) text to the UMLS Metathesaurus                                 shown to outperform traditional approaches, such as TF-IDF
concepts [17]. MetaMap breaks text into paragraphs,                            and word overlap based measures.
sentences, and then phrases. Table I presents a simple                            1) Universal sentence embeddings: The concept of uni-
example of segmenting sentences into phrases. The first refers                 versal sentence embeddings has grown in popularity as it
to the eligibility criteria (NCT03484780) and the second                       leverages models trained on large text corpora. These pre-
illustrates an example from a patient discharge summary.                       trained models can be used in a wide range of downstream
B. Phrases classification                                                      tasks, such as providing versatile sentence-embedding models
                                                                               that convert sentences into vector representations. Notable
    A discharge summary report contains information about                      works include ELMo [24], GPT [25], and BERT [26].
different topics. Therefore, the large number of heterogeneous                    2) BioBERT: BERT (Bidirectional Encoder Representa-
phrases extracted from the patient reports may affect the                      tions from Transformers) is a neural network language model
efficiency and effectiveness of pairwise phrase matching [18].                 trained on plain text for masked word prediction and next sen-
    To minimize the number of required comparisons, we                         tence prediction tasks. BERT applies multi-layer bidirectional
applied a filtering methodology. The latter aims to filter all                 transformer encoder with self-attention. According to [27],
the classes of phrases that do not correspond to a given class,                BERT overall achieved state-of-the-art performances in many
which limits the number of pairs to match.                                     Natural Language Processing tasks and was significantly better
    Data classification techniques could support achieving this                than other models. However, compared against more recent
filtering by separating phrases extracted from patient data and                models, XLNet [28] outperforms BERT and achieves better
clinical trial into different medical categories. This classifica-             prediction metrics on the GLUE benchmark [29], but is not yet
tion filters-out non-matching pairs prior to verification, which               widely used in the medical field. Applying the same architec-
increases the efficiency of phrases similarity matching with                   ture as BERT, Lee et al. [30] proposed the BioBERT language
high precision and without sacrificing recall.                                 model trained on biomedical corpora including PubMED and
    In our study, a total of 1500 eligibility criteria were extracted          PMC. The BioBERT model showed promising results in the
from a Clinical Trials database1 and were manually labelled by                 biomedical domain.
a certified nurse and a data science master student according                     3) Phrase embedding: In this respect, to generate context-
to four classes (diagnosis, drug, procedure, observation).                     rich phrase embeddings, we chose BioBERT as the language
    In this work, we have empirically explored and compared                    model in conjunction with the Bert-as-service library [31].
four methods widely used in classification as our baseline:                    Bert-as-service is a feature extraction service based on BERT
SVM, CNN, LSTM, C-LSTM [19], in order to identify the                          which uses two strategies to derive a fixed-sized vector. In the
ones with the best performance. For SVM and CNN models,                        default strategy, Bert-as-service does average pooling of all
we initialized word embeddings by the average of the word                      of the tokens of second-to-last hidden layer, while the second
embedding over all words in the sentence via PubMed-and-                       uses the output of the special CLS token and is recommended
PMC-w2v [20].                                                                  only after fine-tuning BERT on a downstream task.
    Our experiment indicates that CNN + w2v model has the
                                                                               D. Phrases Similarity Measures
best prediction performance in comparison to the other models
                                                                                 The similarity between two vectors can be evaluated using
  1 https://clinicaltrials.gov/                                                various similarity measures such as Cosine similarity, Eu-


                                                                                                                                                  25
                   Phrases       Phrase                             Phrase                            Maximum                                  Ranking
                 Segmentation Classification                       Embedding                           Cosine                                    &
  Discharge                                                                                           similarity                               Scoring
  Summary
                                         Diagnosis                              Pairwise Cosine
                                         Drug
                                                                                    Distance
                                                                                                            Clinical Trial
                                         Procedure                                                   iec1   iec2    iec3     eec4        ec1    ec2   ec3   ec4    Score
                                                                                                P1   0.9    0.6     0.3      0.5    P1   5.0    3.3   1.4   -2.5   7.3
                                                                                                P2   0.7    0.3     0.1      0.8    P2   3.8    0.8   0.0   -4.4   0.2
  Eligibility                                                                                   P3   0.2    0.4     0.2      0.1    P3   0.6    1.7   0.7   0.0    3.0
                                                                                                P4   0.5    0.8     0.7      0.2    P4   2.5    5.0   4.3   -0.6   11.2
   Criteria                              Diagnosis                                              P5   0.1    0.2     0.8      0.9    P5   0.0    0.0   5.0   -5.0   0.0


                                         Drug
                                                                                           Max similarity between                   Patients' similarity
                                                     Fine-tuned
                                         Procedure
                                                     BioBERT                               Patient_5 phrases and                    ranking regarding
                                                                                            Eligibility_Criteria-1                         EC1


                                                                  Fig. 2. Framework overview


clidean distance, and Manhattan distance. Since these simi-                     reducing thus the need for many heavily-engineered task-
larity metrics are a linear space in which all dimensions are                   specific architectures.
weighted equally, we perform here the similarity matching                          In the context of natural language understanding (NLU)
metrics of different phrases by ranking these phrases according                 technology, comparing the relationship between two sen-
to the cosine similarity. Therefore, the rank of similarity can                 tences is based on several downstream tasks such as Natural
be obtained by the equations     presented in (1) and (2).
                       Clinical Trial                                           Language Inference (NLI) and Semantic Textual Similarity
                  iec1    iec2    iec3   eec4                                   (STS) [29]. Besides that, authors in [33] have shown that
                                         x ·y
            P1    0.9
                          x, y )0.3
                      cos(x
                         0.6
                                 =                                        (1)   fine-tuning BERT on NLI and STS datasets creates sentence
                                        x0.5
                                      ||x || · ||yy ||
            P2    0.7     0.3    0.1     0.8                                    embeddings which achieve an improvement of 11.7 points
            P3    0.2     0.4    0.2     0.1                                    compared to InferSent [34] and 5.5 points compared to the
            P4    0.5
                   if     0.8 A, B
                         cos(A   0.7) > 0.2
                                         cos(A  A, C )                    (2)   Universal Sentence Encoder [22]. In this context, we first
            P5    0.1     0.2    0.8     0.9
         then A is more similar to B than C .                                   fine-tuned BioBERT on STS-B dataset that generated our
                                                                                BioBERT-based model. We then further fine-tuned on MedNLI
   Whereas a pre-trained BioBERT knowledge often shows a                        dataset. We used the fine-tuning classifier from BERT sys-
                 ec1    ec2    ec3   ec4   Score
good performance for certain tasks, as we shall see later on,                   tems [35].
this prior knowledge is not sufficient to compute the similarity
             P1  5.0    3.3    1.4   -2.5  7.3
                                                                                   • MedNLI [36]: is a large, publicly available, expert anno-
             P2  3.8    0.8    0.0   -4.4  0.2
of sentencesP3based
                 0.6 on 1.7
                        their embeddings.
                               0.7   0.0   3.0Indeed, we first tried                  tated dataset drawn from the medical history section of
to compute P4the 2.5
                 cosine5.0similarity
                               4.3    of sentences,
                                     -0.6  11.2       annotated by                    MIMIC-III. MedNLI includes a set of clinical sentence
             P5  0.0    0.0    5.0   -5.0  0.0
experts, using extracted embedding from pre-trained BioBert,                          pairs(14,049 pairs). They were annotated with one of
without any fine-tuning. The result of the comparison was                             three classes: entailment, contradiction, and neutral.
unsatisfactory and unacceptable (table II). The most significant                   • STS-B [37]: is a collection of sentence pairs selected from
sentence is the exact opposite, for example; the most similar                         news headlines. The dataset consists of paired sentences
sentence of ”History of CVA” was ”patient has normal brain                            (8,628 pairs) labelled by humans with a similarity score
MRI” with similarity value of 0.91 which was annotated                                of 1 to 5 denoting how similar the two sentences are in
by experts as ”contradiction”, and the ”Entailment” sentence                          terms of semantic meaning.
”patient has history of stroke” appears in the second place                        2) Evaluation of fine-tuned BioBERT: We evaluated the
with similarity value of 0.89. Therefore, foregoing experiments                 new BioBERT model by computing the cosine similarity
reinforced our belief that it is necessary to fine-tune BioBERT                 between the phrase embeddings. We observed that the model,
on our downstream task.                                                         was not just able to rank phrases in terms of similarity, but
   1) Supervised Fine-tuning: Transfer learning is the process                  also gave a more appropriate cosine value. A representative
of extending a pre-trained model by leveraging data from an                     sample of the results is depicted in Table II.
additional domain for a better model generalization [32]. The
most common transfer learning techniques in NLP is fine-                        E. Matching Patients to Clinical Trials
tuning. Fine-tuning involves copying the weights from a pre-                      After fine-tuning the BioBERT model for optimized cosine
trained network and tuning them using labeled data from the                     similarity and creating both Discharge Summary and Clinical
downstream tasks. BERT is a fine-tuning based representation                    Trial phrases embeddings, we proceeded to find Clinical Trial
model that achieves state-of-the-art performance on a large                     participants from an EMR dataset.
suite of sentence-level tasks, with pre-trained representations                   Formally, we denote:


                                                                                                                                                                           26
                                                                     TABLE II
                                         NLI AND C OS SIMILARITY BEFORE AND AFTER FINE - TUNING OF B IO BERT

                                                                                        Experts             Pre-trained BioBERT             Fine-tuned BioBERT
Phrase 1 (P1)                              Phrase 2 (P2)                                NLI(P1, P2)         Cos(P1, P2)    Rank             Cos(P1, P2)   Rank
                                           patient has history of stroke                Entailment              0.89              1.53              0.87     3.00
History of CVA                             patient has normal brain mri                 Contradiction           0.91              3.00              0.75     0.00
                                           patient is hemiplegic                        Neutral                 0.86              0.00              0.77     0.38
                                       Patient has abnormal EKG findings.               Entailment              0.89              2.05              0.82     3.00
Per report ECG with initial qtc of 410
now 475, QRS 82 initially, now 86      Patient has normal EKG.                          Contradiction           0.90              3.00              0.80     2.30
rate= 95.
                                       Patient has angina.                              Neutral                 0.88              0.00              0.73     0.00
History of hypercholesterolemia and the patient was in a MVC.                           Entailment              0.89              3.00              0.82     3.00
peptic ulcer disease s/p gastric bypass
                                        the patient has no medical history.             Contradiction           0.88              2.48              0.53     0.00
some years ago was involved in a low-
speed MVC.                              the patient has no significant injuries.        Neutral                 0.86              0.00              0.67     1.51


  •   DSi = {phi,1 , phi,2 , ..., phi,r } as the phrases extracted                 different features (eligibility criteria) in the generated matrix
      from Discharge Summary of patient Pi .                                       S, we notice that just because the value of similarity is higher,
   • IEC = {iec1 , iec2 , ..., iecp } as the phrases extracted                     that does not mean that the similarity with the patient is
      from Inclusion Eligibility Criteria.                                         greater. For example if sx,1 and sy,2 represent the highest value
   • EEC = {eec1 , eec2 , ..., eecq } as the phrases extracted                     of the features ec1 and ec2 , respectively, and if sx,1 > sy,2 ,
      from Exclusion Eligibility Criteria.                                         this does not mean that Px has a phrase more similar to ec1
   • EC = {ec1 , ec2 , ..., ecl } = IEC [ EEC | l = p + q as                       than Py for ec2 (as a noticed in equation 2), but only means
      all phrases extracted from Eligibility Criteria.                             that Px and Py are ranked respectively at the top similar of
                  n⇤l
   • S 2 [0, 1]         as the cosine Similarity matrix, where                     the list for ec1 and ec2 . The same logic applies for the lowest
      n and l are the number of Patients and EC elements,                          value, which represents the last order of similarity.
      respectively.                                                                   This variation in the similarity values between features
   1) Matching Patient to Eligibility Criteria: Once phrases                       requires a range normalization step to enable rank similarity
embedding are computed for the patients and the clinical                           instead of cosine similarity, which supports perfectly the
trial eligibility criteria, we calculate the similarity between                    computation of a matching score between patients and the
phrases of the same class (Diagnosis, Drug, Procedure,... ) as                     Clinical Trial. To this end, we generated a new matrix R by
defined in sub-section III-B. An element si,j of S represents                      applying the following feature scaling normalization:
the similarity between patient criteria Pi and single eligibility
criteria ecj . The similarity function is defined by calculating                             (              s     min    (s   )
the cosine between each phrase phi,r extracted from DSi and                                       n ⇥ max8ii,j(si,j ) 8i i,j
                                                                                                                      min8i (si,j )             ;     ecj 2 IEC
                                                                                    ri,j =                       s      min   (s     )
ecj , then only the higher cosine value of similarity is retained                                 ( n) ⇥ max8ii,j(si,j ) 8i i,j
                                                                                                                         min8i (si,j )          ;     ecj 2 EEC
for si,j and all other values are discarded.                                                                                                                   (4)

                                                                                     Finally, the matching score M of Patient Pi with a Clinical
                  si,j =      max        (cos(phi,r , ecj ))               (3)     Trial is determined by:
                           8phi,r 2DSi

                                  i 2 [1, n] &j 2 [1, l]                                                                           l
                                                                                                                                   X
                                                                                                            M(Pi , CT ) =                rij.                     (5)
  Once the similarity values obtained, the final representation
                                                                                                                                   j=1
of S would be as follows:
                                                                                                                IV. E VALUATION
      2                                                                      3
          maxph1,r (cos(ph1,r , ec1 ))            .
                                                                       To validate our framework, we used two datasets; MIMIC-
S =4                  .                           .               5
                                                                    III (Medical Information Mart for Intensive Care) [38] com-
                      .              maxphn,r (cos(phn,r , ecl ))
                                                                    prising information relating to patients admitted to critical care
  2) Ranking and Scoring Patients: The semantic cosine units, and Clinical Trials 2 a Web-based resource providing
similarity calculated in the previous paragraph enables a access to information on supported clinical studies.
proportional similarity instead of exact text semantic matching.
Therefore, when we compare similarity values obtained for             2 https://clinicaltrials.gov/


                                                                                                                                                                        27
                                                                                                           V. C ONCLUSION
                                                                                  EMRs contain a large portion of unstructured data that need
                                                                               to be matched with eligibility criteria for trial-patient enroll-
                                                                               ment. Indeed, the gradual improvement of artificial intelligence
                                                                               technology could reduce the number of physician-hours spent
                                                                               in screening patient eligibility. To tackle the problem, we pro-
                                                                               posed a framework designed to automatically recommend the
                                                                               most suitable patients for a clinical trial. The framework adopts
                                                                               a pre-trained language model (BioBERT) and uses STS-B and
                                                                               MedNLI datasets to improve the accuracy of the model via
Fig. 3. The eligibility criteria specified in the NCT04078425 clinical trial   transfer learning. This work verified that the fine-tuning of
                                                                               BioBERT shows better performance in calculating the simi-
                           TABLE III                                           larity between two medical sentences using embedding-based
  R ANKS AND SCORES OF MATCHING 10 PATIENTS WITH 6 ELIGIBILITY                 metrics. In future works, we will also explore EMRs structured
                    CRITERIA (NCT04078425)
                                                                               tables in order to significantly improve the performance and
         iec1     iec2      eec1      eec2      eec3       eec4      Score     accuracy of our trial-patient matching framework.
 P-1     9.46     9.84      -8.47     -2.74     -1.02      -1.02       6.03
 P-2     5.02     7.44      -3.43     -3.12     -8.42      -8.42     -10.93                              ACKNOWLEDGMENT
 P-3     9.08     8.65      -10.00    -2.38     -10.00     -10.00    -14.65
 P-4     0.00     4.09      -2.96     -5.76     -5.24      -5.24     -15.12       The authors would like to thank Marvin Moughabghab for
 P-5     3.43     4.02      -6.26     -2.69     -1.09      -1.09      -3.69    his efforts and contributions to this work.
 P-6     5.19     0.00      0.00      -1.42     0.00       0.00       3.77
 P-7     5.65     2.95      -3.86     -2.72     -0.15      -0.15       1.72                                   R EFERENCES
 P-8     7.26     10.00     -7.52     -5.76     -6.98      -6.98      -9.99
 P-9     6.43     9.14      -4.44     -10.00    -2.70      -2.70      -4.27     [1] H. Dhayne, R. Haque, R. Kilany, and Y. Taher, “In search of big medical
 P-10    10.00    7.44      -6.27     0.00      -10.00     -10.00     -8.83         data integration solutions-a comprehensive survey,” IEEE Access, vol. 7,
                                                                                    pp. 91 265–91 290, 2019.
                                                                                [2] G. De Moor, M. Sundgren, D. Kalra, A. Schmidt, M. Dugas, B. Claer-
                                                                                    hout, T. Karakoyun, C. Ohmann, P.-Y. Lastic, N. Ammour et al., “Using
A. Text processing                                                                  electronic health records for clinical research: the case of the ehr4cr
                                                                                    project,” Journal of biomedical informatics, vol. 53, pp. 162–173, 2015.
   MIMIC III Clinical Dataset is a critical care database that                  [3] M. Dugas, M. Lange, C. Müller-Tidow, P. Kirchhof, and H.-U. Prokosch,
contains 2,083,108 medical reports from 46,520 patients. We                         “Routine data from hospital information systems can support patient
                                                                                    recruitment for clinical studies,” Clinical Trials, vol. 7, no. 2, pp. 183–
experimented with a randomly selected dataset of 100 Dis-                           189, 2010.
charge Summaries from patients last visit, excluding patients                   [4] S. T. Rosenbloom, J. C. Denny, H. Xu, N. Lorenzi, W. W. Stead, and
whose ages are under 18. The segmentation stage produces an                         K. B. Johnson, “Data from clinical notes: a perspective on the tension
                                                                                    between structure and flexible documentation,” Journal of the American
average of 400 phrases per report.                                                  Medical Informatics Association, vol. 18, no. 2, pp. 181–186, 2011.
   We selected a clinical trial that identifies the role of Aldos-              [5] H. Dhayne, R. Kilany, R. Haque, and Y. Taher, “Sedie: A semantic-
                                                                                    driven engine for integration of healthcare data,” in 2018 IEEE Interna-
terone antagonist in patients of heart failure with preserved                       tional Conference on Bioinformatics and Biomedicine (BIBM). IEEE,
ejection fraction (NCT04078425). Fig. 3 shows the five eligi-                       2018, pp. 617–622.
bility criteria of this clinical trial.                                         [6] P. Raghavan, J. L. Chen, E. Fosler-Lussier, and A. M. Lai, “How
                                                                                    essential are unstructured clinical narratives and information fusion to
                                                                                    clinical trial recruitment?” AMIA Summits on Translational Science
B. Evaluation of the obtained results                                               Proceedings, vol. 2014, p. 218, 2014.
                                                                                [7] S. W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim,
   Table III presents the results for a sample of ten patients. In                  “A practical method for transforming free-text eligibility criteria into
                                                                                    computable criteria,” Journal of biomedical informatics, vol. 44, no. 2,
order to evaluate the clinical correctness of patients matching                     pp. 239–250, 2011.
to the clinical trial(NCT04078425), a validation task was per-                  [8] S. Kripalani, F. LeFevre, C. O. Phillips, M. V. Williams, P. Basaviah,
formed manually by a nurse and a computer science student.                          and D. W. Baker, “Deficits in communication and information transfer
                                                                                    between hospital-based and primary care physicians: implications for
The noteworthy fact is that the evaluation of the matching                          patient safety and continuity of care,” Jama, vol. 297, no. 8, pp. 831–
does not reveal false positives in the score results. Indeed, the                   841, 2007.
similarity scores reflect the order of matching between patients                [9] K. Milian, R. Hoekstra, A. Bucur, A. ten Teije, F. van Harmelen,
                                                                                    and J. Paulissen, “Enhancing reuse of structured eligibility criteria and
and the clinical trial. The score distribution ranged from (-15)                    supporting their relaxation,” Journal of biomedical informatics, vol. 56,
to (8), and eligible patients to be retained for further screening                  pp. 205–219, 2015.
by experts were those with a score greater than 5.                             [10] C. Patel, J. Cimino, J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershen-
                                                                                    baum, L. Ma, E. Schonberg, and K. Srinivas, “Matching patient records
   We should note that the scores would be more realistic if                        to clinical trials using ontologies,” in The Semantic Web. Springer,
the segmentation process was more accurate. For instance, the                       2007, pp. 816–829.
sentence ”you were thought to have a blood clot in your right                  [11] T. Kang, S. Zhang, Y. Tang, G. W. Hruby, A. Rusanov, N. Elhadad,
                                                                                    and C. Weng, “Eliie: An open-source information extraction system
leg” was segmented by Metamap into ”a blood clot in your                            for clinical trial eligibility criteria,” Journal of the American Medical
right leg” which would result in a false outcome.                                   Informatics Association, vol. 24, no. 6, pp. 1062–1071, 2017.


                                                                                                                                                                  28
[12] C. Yuan, P. B. Ryan, C. Ta, Y. Guo, Z. Li, J. Hardin, R. Makadia, P. Jin,       [36] A. Romanov and C. Shivade, “Lessons from natural language inference
     N. Shang, T. Kang et al., “Criteria2query: a natural language interface              in the clinical domain,” arXiv preprint arXiv:1808.06752, 2018.
     to clinical databases for cohort definition,” Journal of the American           [37] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-
     Medical Informatics Association, vol. 26, no. 4, pp. 294–305, 2019.                  2017 task 1: Semantic textual similarity-multilingual and cross-lingual
[13] C. Shivade, C. Hebert, M. Lopetegui, M.-C. De Marneffe, E. Fosler-                   focused evaluation,” arXiv preprint arXiv:1708.00055, 2017.
     Lussier, and A. M. Lai, “Textual inference for eligibility criteria reso-       [38] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghas-
     lution in clinical trials,” Journal of biomedical informatics, vol. 58, pp.          semi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimic-iii, a
     S211–S218, 2015.                                                                     freely accessible critical care database,” Scientific data, vol. 3, p. 160035,
[14] Y. Ni, S. Kennebeck, J. W. Dexheimer, C. M. McAneney, H. Tang,                       2016.
     T. Lingren, Q. Li, H. Zhai, and I. Solti, “Automated clinical trial
     eligibility prescreening: increasing the efficiency of patient identification
     for clinical trials in the emergency department,” Journal of the American
     Medical Informatics Association, vol. 22, no. 1, pp. 166–178, 2014.
[15] D. D. Palmer, “Tokenisation and sentence segmentation,” Handbook of
     natural language processing, pp. 11–35, 2000.
[16] A. R. Aronson, “Effective mapping of biomedical text to the umls
     metathesaurus: the metamap program.” in Proceedings of the AMIA
     Symposium. American Medical Informatics Association, 2001, p. 17.
[17] A. R. Aronson and F.-M. Lang, “An overview of metamap: historical
     perspective and recent advances,” Journal of the American Medical
     Informatics Association, vol. 17, no. 3, pp. 229–236, 2010.
[18] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, and W. Nejdl,
     “A blocking framework for entity resolution in highly heterogeneous
     information spaces,” IEEE Transactions on Knowledge and Data Engi-
     neering, vol. 25, no. 12, pp. 2665–2682, 2012.
[19] C. Zhou, C. Sun, Z. Liu, and F. Lau, “A c-lstm neural network for text
     classification,” arXiv preprint arXiv:1511.08630, 2015.
[20] S. Moen and T. S. S. Ananiadou, “Distributional semantics resources
     for biomedical text processing.”
[21] G. Soğancıoğlu, H. Öztürk, and A. Özgür, “Biosses: a semantic sentence
     similarity estimation system for the biomedical domain,” Bioinformatics,
     vol. 33, no. 14, pp. i49–i58, 2017.
[22] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John,
     N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal
     sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.
[23] Q. Chen, Y. Peng, and Z. Lu, “Biosentvec: creating sentence embeddings
     for biomedical texts,” arXiv preprint arXiv:1810.09302, 2018.
[24] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
     and L. Zettlemoyer, “Deep contextualized word representations,” arXiv
     preprint arXiv:1802.05365, 2018.
[25] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
     “Improving        language      understanding      by     generative     pre-
     training,” URL https://s3-us-west-2. amazonaws. com/openai-
     assets/researchcovers/languageunsupervised/language           understanding
     paper. pdf, 2018.
[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
     of deep bidirectional transformers for language understanding,” arXiv
     preprint arXiv:1810.04805, 2018.
[27] A. Talman and S. Chatzikyriakidis, “Testing the generalization power
     of neural network models across nli benchmarks,” in Proceedings of the
     2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural
     Networks for NLP, 2019, pp. 85–94.
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le,
     “Xlnet: Generalized autoregressive pretraining for language understand-
     ing,” arXiv preprint arXiv:1906.08237, 2019.
[29] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman,
     “Glue: A multi-task benchmark and analysis platform for natural lan-
     guage understanding,” arXiv preprint arXiv:1804.07461, 2018.
[30] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang,
     “Biobert: pre-trained biomedical language representation model for
     biomedical text mining,” arXiv preprint arXiv:1901.08746, 2019.
[31] H. Xiao, “bert-as-service,” https://github.com/hanxiao/bert-as-service,
     2018.
[32] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning
     in natural language processing,” in Proceedings of the 2019 Conference
     of the North American Chapter of the Association for Computational
     Linguistics: Tutorials, 2019, pp. 15–18.
[33] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using
     siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
[34] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes,
     “Supervised learning of universal sentence representations from natural
     language inference data,” arXiv preprint arXiv:1705.02364, 2017.
[35] “google-research/bert: Tensorflow code and pre-trained models for bert,”
     https://github.com/google-research/bert, (Accessed on 09/17/2019).


                                                                                                                                                                           29