An analogy based framework for patient-stay identification in healthcare

An analogy based framework for patient-stay identification in healthcare SafaAlsaidi safa.alsaidi@inria.fr Inria Paris

F-75012 Paris France

Centre de Recherche des Cordeliers Inserm Université Paris Cité Sorbonne Université

F-75006 Paris France

MiguelCouceiro miguel.couceiro@loria.fr LORIA CNRS Universite de Lorraine

F-54000 France

EstebanMarquer esteban.marquer@loria.fr LORIA CNRS Universite de Lorraine

F-54000 France

SophieQuennelle sophie.quennelle@inria.fr Inria Paris

F-75012 Paris France

Centre de Recherche des Cordeliers Inserm Université Paris Cité Sorbonne Université

F-75006 Paris France

Service d'Informatique Biomédicale Hôpital Necker-Enfants Malades Assistance Publique -Hôpitaux de Paris

F-75015 Paris France

AnitaBurgun anita.burgun@aphp.fr Inria Paris

F-75012 Paris France

Centre de Recherche des Cordeliers Inserm Université Paris Cité Sorbonne Université

F-75006 Paris France

Imagine Institute

F-75015 Paris France

Service d'Informatique Biomédicale Hôpital Necker-Enfants Malades Assistance Publique -Hôpitaux de Paris

F-75015 Paris France

NicolasGarcelon nicolas.garcelon@institutimagine.org Inria Paris

F-75012 Paris France

Centre de Recherche des Cordeliers Inserm Université Paris Cité Sorbonne Université

F-75006 Paris France

Imagine Institute

F-75015 Paris France

Service d'Informatique Biomédicale Hôpital Necker-Enfants Malades Assistance Publique -Hôpitaux de Paris

F-75015 Paris France

AdrienCoulet adrien.coulet@inria.fr Inria Paris

F-75012 Paris France

Centre de Recherche des Cordeliers Inserm Université Paris Cité Sorbonne Université

F-75006 Paris France

An analogy based framework for patient-stay identification in healthcare 1613-0073 6E5814AEAD4D6B1FD5DF4EDC05E6FA2C GROBID - A machine learning software for extracting information from scholarly documents analogy classification, patient matching, electronic health records, patient representation learning, (A. Coulet) 0000-0002-4132-1068 (S. Alsaidi) 0000-0003-2316-7623 (M. Couceiro) 0000-0003-2315-7732 (E. Marquer) 0000-0002-4782-6737 (S. Quennelle) 0000-0001-6855-4366 (A. Burgun) 0000-0002-3326-2811 (N. Garcelon) 0000-0002-1466-062X (A. Coulet)

Analogical proportions are statements of the form "𝐴 is to 𝐵 as 𝐶 is to 𝐷". Analogies have been used in various reasoning and classification tasks, addressing different domains. Representation learning has enabled interesting progress in various analogy reasoning applications, where it focuses on the challenge of obtaining a vector representation of complex data. In the biomedical domain, representation learning has been adapted to patient data to solve various tasks such as predicting readmission, diagnosis, and length of stay. In this paper, we focus on the particular task of patient-stay identification, i.e., does a hospital stay belong to a patient or not? This constitutes a building block for addressing key biomedical tasks such as patient matching and privacy preservation. We propose a prototypical architecture that combines patient-stay representation learning and the analogical reasoning framework. For evaluation, we constitute sets of analogies from real-word Electronic Health Records, where objects are patient-stay representations learned from the data. We enrich our analogies using analogical properties and use them to train a neural model to detect whether an analogy is valid. We define three first experimental setups to address our task, present our empirical results, and discuss further perspectives.

Introduction

An analogical proportion, or simply analogy, is a quaternary relation involving four objects 𝐴, 𝐵, 𝐶, and 𝐷 that draws a parallel between the relation between 𝐴 and 𝐵 and the relation between 𝐶 and 𝐷, and that supports analogical reasoning. There are two common tasks associated with analogies, namely, analogy detection and analogy solving. Analogy detection aims at deciding whether a quadruple ⟨𝐴, 𝐵, 𝐶, 𝐷⟩ constitutes a valid analogy. Analogy solving aims at finding an 𝑥 that makes 𝐴 : 𝐵 :: 𝐶 : 𝑥 a valid analogy. Analogy reasoning has been applied to different Natural Language Processing (NLP) tasks such as mining paradigm tables in linguistics and image generation [1,2].

Representation learning consists of learning low-dimension feature representations (i.e., embeddings) from data. These embeddings, or vector representations, of objects (i.e., words, images, characters, etc.) underpin much of modern machine learning and have demonstrated impressive performance on various downstream NLP tasks. For instance, Lim et al. [3] proposed a deep learning model to tackle analogies using semantic embeddings. Their architecture integrates the characteristics of analogies by design and relies heavily on pretrained GloVe embeddings [4]. These embeddings were not trained explicitly to find analogies; yet they were able to detect differences between objects. Hertzmann et al. [5] proposed an analogical framework to learn "image filters" between a pair of images to create an "analogous" filtered result on a third image. The generated image 𝐷 should relate to 𝐶 in the same way as 𝐵 relates to 𝐴. Alsaidi et al. [6] developed a neural approach and used character-based embeddings to detect morphological analogies between words.

Analogies have not been sufficiently exploited in healthcare, which thus motivates our work. However, practitioners unconsciously use analogical reasoning (i.e., medical reasoning) in their daily clinical practice to understand the possible causes for a disease diagnosis and prognosis by linking visible signs and symptoms that have been observed among different patients. In addition, several machine learning methods were applied to investigate analogies in healthcare. For instance, Casteleiro et al. [7] utilized analogies to infer disease treatments from statements extracted from text. In their work, they try to extract biomedical facts by analogical reasoning from embeddings. Dynomant et al. [8] used analogical proportions to compare embedding methods trained on a corpus of French health-related documents (i.e., discharge summary, procedure reports, and prescriptions). Analogical proportions were applied on the embeddings of medical documents to verify if (𝐴 ⃗ − 𝐵 ⃗ ) + 𝐶 ⃗ ≈ 𝐷 ⃗ , thus allowing to check whether the similarity between 𝐴 and 𝐵 is similar to the one between 𝐶 and 𝐷. An example of an analogical proportion they obtain is "(cardiologyheart) + lung ≈ pneumology." Rather et al. [9] used analogical proportions to identify hidden or unknown biomedical knowledge from free text resources. They proposed analogical proportions of the form "acetaminophen is a type of drug as diabetes is a type of disease. "

In this paper, we aim to explore how the analogy framework can help in solving tasks relevant to the healthcare domain. We propose two models that learn patient-stay representations (i.e., learn a vector representation of all the patient EHR data collected during a single stay) to detect analogies in healthcare. To do so, we define two crucial steps that are (1) the learning of embeddings adapted to patient data, and (2) the definition of a neural network dedicated to learn formal properties of analogy. As for the network, we use the same model that was proposed by Lim et al. [3] for word semantics, and later adapted by Alsaidi et al. [6] by incorporating character-based embeddings for morphological analogies. We argue that the framework itself has the potential to be applied in a wide range of domains, and we propose to use it here for healthcare applications, namely, for the patient identification task we introduce below.

Electronic Health Records (EHRs) are real world healthcare data that have been used to train predictive models (including neural network models) for different biomedical tasks, e.g., predicting patient mortality, hospital readmission, length of stay, etc. These EHRs consist of clinical and administrative data collected during patient hospital stays in the form of both structured and unstructured data. Structured data generally includes diagnostic codes, lab tests, demographics, admission-related information, etc. It can be either static, e.g., patient demographics, or temporal, e.g., vital signs. Unstructured data includes various documents in natural language such as clinical notes, nursing reports, discharge summaries, lab reports, etc. For this work, we consider EHRs from the MIMIC-III (Medical Information Mart for Intensive Care, version 3) database [10] to learn patient representations (i.e., patient embeddings) by converting patient data from the raw EHRs to embeddings that can be further processed. MIMIC-III is a free publicly available hospital database containing de-identified patient health data. This database has been widely used by researchers conducting data mining and machine learning studies applied to healthcare. Several neural network architectures have been developed to represent biomedical data. For instance, Si et al. [11] adapted a multi-level CNN to learn patient representations from clinical notes through a multi-task learning framework to predict patient mortality and length of stay. Zhang et al. [12] used GRU-based RNN to capture relationships between clinical events and employed attention mechanism to learn a personalized representation to predict patient's future hospitalization using EHR data. Madhumita et al. [13] used a stacked denoising autoencoder and a paragraph vector model to learn generalized patient representations directly from clinical notes to predict patient mortality, primary diagnostic, procedural category, and patient gender. Zhang et al. [14] proposed two neural network architectures that enhance patient representation learning by combining sequential unstructured notes with structured data and evaluated these representations on 3 risk evaluation tasks (i.e., in-hospital mortality, 30-day hospital readmission, and length of stay prediction). In our paper, we learn patient-stay representations and consider the task of patient-stay identification. We think that the tools that address this task will serve as building blocks for more complex and key biomedical tasks, such as patient matching and privacy preservation checking [15,16].

In this paper, we particularly propose to tackle this task by relying on the detection of analogies in healthcare. In Section 2, we define the setting of analogy that we work on. The models we propose to detect analogies are described in Section 3, along with the procedures we use for data augmentation, training, and evaluation. In Section 4, we provide a description of the MIMIC-III dataset and detail how we build our experimental dataset. We present our experiments and report our results in Section 5. In Section 6, we discuss perspectives for future research.

The main contributions of this paper are the following:

• we propose an analogy based setting using patient-stay representations;

• we propose an embedding model to learn patient-stay representations;

• we display the performance of our classification model to detect analogies on patient-stay data.

Defining the task

As we defined previously, an analogy is a 4-ary relation written as 𝐴 : 𝐵 :: 𝐶 : 𝐷 and expressed as "𝐴 is to 𝐵 as 𝐶 is to 𝐷". In this paper, we work on patient-stay analogies, i.e., on analogies involving hospital stay. In our setting, 𝐴, 𝐵, 𝐶, and 𝐷 represent patient-stay representations.

We define an analogy based setting on patient-stay data that we refer to as Identity setting.

For that, we consider patient-stay representations, which are vector representations of EHR data that belong to a single hospital stay. Based on the type of EHR data that we decide to include, our patient-stay representations can be made of a representation of either structured or unstructured data, or they can be made of the concatenation of both types of data. More details are provided in Section 5. For this setting, we propose to build analogies of the form:

𝑠 𝑖 1 𝑡 1 : 𝑠 𝑖 1 𝑡 2 :: 𝑠 𝑖 2 𝑡 3 : 𝑠 𝑖 2 𝑡 4

where 𝑠 𝑖 𝑡 refers to the stay 𝑡 of patient 𝑖. Here, pairs of the analogy quadruples are made of two random stays belonging to the same patient. Since there is no constraint on the order of stays, 𝑠 𝑖 1 𝑡 1 can happen before 𝑠 𝑖 1 𝑡 2 or the inverse. Note that 𝑖 1 and 𝑖 2 can be the same patient, and that 𝑡 1 and 𝑡 2 , or 𝑡 3 and 𝑡 4 , can represent the same time stamp. Furthermore, 𝑡 1 and 𝑡 3 or 𝑡 2 and 𝑡 4 can be the same when 𝑖 1 = 𝑖 2 (but not when 𝑖 1 ̸ = 𝑖 2 ). The Identity setting finds applications in several tasks relevant to biomedical informatics, including:

• data cleaning, • data privacy related application, • patient matching.

Data cleaning applications in the health domain involve repairing or removing patient health data that is inaccurate, incorrectly structured, duplicative, or incomplete. In data cleaning applications, we can associate an erroneously affected sample of data to the patient it belongs. Privacy related applications include verifying if patient data is de-identified, and whether it can be re-identified using different systems. Patient matching is defined as the identification and linking of one patient's data within and across health databases in order to obtain a comprehensive view of that patient's health care record [17]. In patient matching, we try to match patient-related information, either a single patient data (e.g., a document) or full EHR data, that can coexist in one or several databases.

In this paper, we try to match patient-stay representations to the patient they belong to. We focus on the task of patient-stay identification, where we aim to determine if a particular hospital stay belongs to a certain patient. We address this task by learning a model to classify such quadruples into valid and invalid analogies. In this sense, we implement the task of analogy detection that aims to determine if a quadruple is a valid analogy. For our Identity setting, we define a valid analogy as a quadruple of four stays

(𝑠 𝑖 1 𝑡 1 , 𝑠 𝑖 1 𝑡 2 , 𝑠 𝑖 2 𝑡 3 , 𝑠 𝑖 2 𝑡 4 )

, where each pair of two stays belong to a single patient 𝑖 𝑗 ; other forms of analogies are considered invalid.

Proposed Approach

Our model is made of two components: an embedding model and a classification model. The second takes as input patient-stay representations computed by the first (see Section 3.1).

ICCBR'22 Workshop Proceedings

Our embedding model is trained along with the classification model. We also detail the data augmentation procedure in Section 3.2, and describe the training and evaluation protocols that we followed in Section 3.3.

Embedding and Classification Models

The models described in this subsection are schematized in Figure 1. Classification Model. As in Alsaidi et al. [6], we adapt the neural architecture in Lim et al. [3] to our patient-stay setting. Our classification model determines if an analogy 𝐴 : 𝐵 :: 𝐶 : 𝐷 is valid by verifying if 𝐴 and 𝐵 differ in the same way as 𝐶 and 𝐷. The architecture of the classification model is a Convolutional Neural Network (CNN), which takes as input the embeddings of size 𝑛 of four elements 𝐴, 𝐵, 𝐶, 𝐷. We stack them to get a matrix of size 𝑛 × 4. The CNN is made of three layers as depicted in the right frame of Figure 1. The first convolutional layer with 128 filters of 1 by 2 is applied on the embeddings, such that it analyses each pair separately without overlaps and measures how 𝐴 and 𝐵, and how 𝐶 and 𝐷 differ for each component. The second convolutional layer with 64 filters of 2 by 2 is applied on the resulting matrix, after which the result is flattened into a 64 × (𝑛 − 1) unidimensional vector and used as input of a fully connected dense layer that produces a single output. The second layer aims at checking if the difference between 𝐴 and 𝐵 is the same as the one between 𝐶 and 𝐷. If 𝐴 and 𝐵 are different in the same way as 𝐶 and 𝐷, then 𝐴 : 𝐵 :: 𝐶 : 𝐷 is a valid analogy. The last layer aggregates this information using a sigmoid activation to get a result (i.e., output of the classification model) between 0 (for invalid analogies) and 1 (for valid analogies). All layers, except the last one, use Regularized Linear Unit (ReLU) as activation function.

Data Augmentation

Deep neural network approaches require large amounts of data. Therefore we took advantage of properties of analogies to produce additional proportions based on our dataset in a process called data augmentation. Previous works [19,20,21] have proposed postulates that analogies should obey. For this study, we consider the following: Based on the definition of our analogical setting, we can apply all the above-mentioned postulates to generate valid analogical proportions except for central permutation, which can only be applied in the very particular case when 𝑖 1 = 𝑖 2 . When 𝑖 1 ̸ = 𝑖 2 , central permutation cannot be applied to increase our dataset as it would enable to associate stays of distinct patients, which is inconsistent with the aim of the Identity setting. Note that from reflexivity and central permutation we can deduce inner reflexivity. As reflexivity forces 𝑖 1 = 𝑖 2 , applying it in cases where 𝑖 1 ̸ = 𝑖 2 would result in a case where 𝑖 1 = 𝑖 2 .

For the cases where 𝑖 1 ̸ = 𝑖 2 , given a valid analogy we can generate eight additional valid analogical proportions, namely For cases where 𝑖 1 = 𝑖 2 , we apply reflexivity to generate one more valid analogical proportion, namely 𝐴 : 𝐵 :: 𝐴 : 𝐵. Note that for cases where 𝑖 1 = 𝑖 2 , invalid analogical proportions would be considered valid.

Training and Evaluation

As mentioned, we define a valid analogy as a quadruple of four stays

(𝑠 𝑖 1 𝑡 1 , 𝑠 𝑖 1 𝑡 2 , 𝑠 𝑖 2 𝑡 3 , 𝑠 𝑖 2 𝑡 4 )

, where each pair of two stays belong to a single patient 𝑖 𝑗 . For each analogy in the dataset, we start by embedding the four stays. We augment the embeddings using the postulates that we recalled in Section 3.2. As a result, we generate 9 valid analogical proportions (i.e., positive examples) and 2 invalid analogical proportions for cases where 𝑖 1 ̸ = 𝑖 2 . For cases where 𝑖 1 = 𝑖 2 , we obtain 10 + 2 = 12 valid analogical proportions and no invalid analogical proportions. For optimization, we use the Binary Cross-Entropy (BCE) loss. To evaluate the classification model we use the same data augmentation process as for training, and we compute the accuracy and F1 score.

Dataset description

For our experiments, we used EHRs from the MIMIC-III [10] as a source of patient medical history data. MIMIC-III is a critical care database developed by the Massachusetts Institute of Technology (MIT)'s Laboratory for Computational Physiology and distributed by PhysioNet [22]. The database is publicly available, where it is accessible to researchers after finishing a HIPAA training course demanded by the National Institutes of Health (NIH). The database contains health-related information associated with all patients admitted to the ICU (Intensive Care Unit) of Beth Israel Deaconess Medical Center between the years 2001 and 2012. It encompasses data of more than 40,000 ICU patients with more than 60,000 ICU stays. All patients' data has been de-identifed in accordance with Health Insurance Portability and Accountability Act (HIPAA). The dataset contains various types of data such as patient demographics, vital signs, lab test results, medications, hospital length of stay, procedures, clinical notes, diagnosis codes (ICD-9), imaging reports, etc.

To build our dataset, we keep only adult patients (i.e., patients aged 18 and above) with at least two admissions. As we do not define any order constraint, we obtain all the permutations of all the stays belonging to a patient. We organize our dataset in way where each pair of stay is associated to the patient it belongs to: ⟨𝑆 1 , 𝑆 2 , 𝑃 𝐴𝑇 𝐼𝐸𝑁 𝑇 _𝐼𝐷⟩, where 𝑆 1 corresponds to 𝑠 𝑖 1 𝑡 1 , 𝑆 2 corresponds to 𝑠 𝑖 1 𝑡 2 , and the associated 𝑃 𝐴𝑇 𝐼𝐸𝑁 𝑇 _𝐼𝐷 that represents 𝑖 1 . We obtain a dataset made of 46,986 triples, where for each two pairs of stays we produce an analogy. For our experiments, we use all hospital stays associated with randomly selected 200 patients. We use the data augmentation process to generate positive and negative examples. For training and evaluation, we perform a random split (using a fixed random seed) in a training set of 70% of the extracted analogies, the remaining 30% serving as the test set. We end up with 939,638 analogies for training and 402,703 for testing. To maintain reasonable training and evaluation time, we randomly selected 50,000 analogies from the training set and 50,000 analogies from the testing set.

Experiment Setup

We now present the three experiments that we conducted in the Identity setting. In Section 5.1, we describe the patient-stay features that we consider and the data preprocessing that we performed for structured and unstructured data. We describe the implementation details in Section 5.2. The results of our experiments are reported in Section 5.3 and discussed further in this section. The code used for our experiments is written in Python 3.9 and PyTorch and is available in the repository https://github.com/Safa-98/patient-stay-analogy.

Stay Features and Data Preprocessing

We consider both structured (i.e., demographics and admission-related information) and unstructured data (i.e., clinical notes) to define our analogies. In this subsection, we describe the patient-stay features that are utilized by our model and some data preprocessing details.

Static information.

In our experiments, our static information consists of demographic information and admission-related information. For demographic information, we extract patient's age, gender, marital status, ethnicity, and insurance information. We keep only adult patients (i.e., patients aged 18 and above). We split the age into 5 groups [18, 25[, [25, 45[, [45, 65[, [65, 89[ and [89, +∞[. For admission-related information, we include admission type as features.

Clinical notes. Nursing, Nursing/Other, Physician, and Radiology notes make up the majority of clinical notes in MIMIC-III database. For each hospital stay, we only kept notes that belong to these 4 categories. We excluded notes that have an error tag and notes that lack a hospital admission id.

Implementation Details

To build our corresponding cohorts, we performed the preprocessing described in the previous section to obtain our patient-stay features. Patients without any records of clinical notes or with notes that do not belong to the 4 categories defined above were removed. We computed the median of notes per hospital admission to determine the number of clinical notes to extract For the unsupervised Doc2Vec model [18], we finetune it on the training set to obtain the document-level embeddings using the Gensim toolkit [23]. For the training algorithm, we use PV-DBOW (Paragraph vector-Distributed Bag of Words). We set the number of training epochs as 30, the initial learning rate as 0.025, the learning rate decay as 0.0002, and the dimension of vectors as 200 to train. The Fusion CNN model is trained with Adam optimizer with a learning rate of 0.0001 and ReLU as the activation function. The chosen batch size is 64.

In this paper we perform three experiments. In the first, we consider both structured and unstructured data. Therefore, we obtain our patient-stay representation by concatenating the representations of clinical notes along with static information. In this experiment, we verify if a particular hospital stay belongs to a patient by looking at both the structured and unstructured data associated with each stay. In the second, we only consider unstructured data, which means that our patient-stay representations are based solely on the representations of clinical notes. Therefore, by looking at clinical notes associated with a single hospital stay, we check if a particular hospital stay belongs to a patient. In the third, we only consider structured data, which means that our patient-stay representations are based solely on the representations of static information (i.e., demographics and admission-related information). Therefore, we verify if a particular hospital stay belongs to a patient by looking at the static information that is associated with a hospital stay.

Results and Discussion

As mentioned previously, we conducted three experiments that mainly differ in what type of data was used to obtain our patient-stay representations. For all the experiments, we used 50,000 analogies for training and evaluation, and applied the same procedure for data augmentation. We report the accuracy and F1 score for each experiment. The F1 score gives a better measure of the incorrectly classified cases than the accuracy metric.

For the first experiment, we fed our embedding model with both structured (i.e., demographics and admission-related information) and unstructured data (i.e., clinical notes). Our patient-stay representations are thus made of the concatenation of static information and clinical notes. We chose the epochs where the training loss is at the local minimum. We trained our model for 10, 20, and 40 epochs, with 3 different random initializations in each case. Our results are detailed in Table 1. Our model performs the best for positive examples. For 40 epochs, the model gives the best result for valid analogies and performs best for invalid analogies for 20 epochs.

To gain more insight into how our models perform, we conducted an error analysis where we noticed that most misclassifications were spotted in two cases.

1. Cases where 𝑖 1 = 𝑖 2 .

To recall, we do not generate invalid analogies for cases where 𝑖 1 = 𝑖 2 ; therefore, invalid analogy forms (𝐷 : 𝐴 :: 𝐵 : 𝐶 and 𝐴 : 𝐶 :: 𝐵 : 𝐷) should be considered valid in these cases. In our error analysis, we noticed that when the four stays belong to the same patient, our model classifies the above-mentioned invalid analogy forms as invalid instead of valid. We believe that our model was not trained enough to distinguish these forms of analogies as there were less analogies with four stays belonging to the same patient generated in our dataset. 2. Cases where representations are made of only clinical notes.

To recall, in our second experiment we only used the representations of clinical notes to obtain patient-stay representations. We noticed that when the category of the clinical notes is similar between two hospital stays or when two hospital stays have less than five clinical notes, our model struggles to distinguish between the two hospital stays. This indicates that in some cases using only clinical notes to learn patient-stay representations might not be sufficient as these notes might not contain enough information to help our model differentiate between two similar stays that belong to two distinct patients. As a result, the model would incorrectly match these two similar stays to the same patient.

In these experiments, we did not include temporal data, where we only used demographics and admission-related information as structured data. It would be interesting to also include temporal signals (i.e., vital signs) along with demographics and admission-related information as structured data. Our patient-stay representations would be then made of the concatenation of the representations of static information and temporal signals as structured data and the representation of clinical notes as unstructured data.

Conclusion and Perspectives

We adapted the approach in [3,6] from semantic and morphological analogies to patient-stay analogies. Our prototypical architecture has some limits, but seems promising for the task of patient identification. Our classification model is flexible in terms of the analogies that it classifies. Changing the way the data is augmented will change the way the model behaves. Our model can be adapted to different healthcare applications through dedicated embedding models [24]. Inspired by [14], we implemented a model to build patient-stay representations. As mentioned in Section 5.3, there are multiple plausible improvements to our approach, in terms of balancing valid and invalid analogies as well as including other types of data to build our patient-stay representations. As we limited ourselves to analogy detection, a future work would be to address analogy solving in the same setting that would allow the generation of synthetic patient-stays.

Figure 1 :1Figure 1: The Fusion CNN embedding model and the CNN classification model.

•𝐴 : 𝐵 :: 𝐴 : 𝐵 (reflexivity); • 𝐴 : 𝐴 :: 𝐶 : 𝐶 (inner reflexivity); • 𝐴 : 𝐵 :: 𝐶 : 𝐷 → 𝐶 : 𝐷 :: 𝐴 : 𝐵 (symmetry); • 𝐴 : 𝐵 :: 𝐶 : 𝐷 → 𝐵 : 𝐴 :: 𝐷 : 𝐶 (inner symmetry); • 𝐴 : 𝐵 :: 𝐶 : 𝐷 → 𝐴 : 𝐶 :: 𝐵 : 𝐷 (central permutation).

•𝐶 : 𝐷 :: 𝐴 : 𝐵, • 𝐷 : 𝐶 :: 𝐵 : 𝐴, • 𝐵 : 𝐴 :: 𝐷 : 𝐶, • 𝐴 : 𝐴 :: 𝐶 : 𝐶, • 𝐵 : 𝐴 :: 𝐶 : 𝐷, • 𝐴 : 𝐵 :: 𝐷 : 𝐶, • 𝐶 : 𝐷 :: 𝐵 : 𝐴, • 𝐷 : 𝐶 :: 𝐴 : 𝐵; and two invalid analogical proportions, namely • 𝐷 : 𝐴 :: 𝐵 : 𝐶 and • 𝐴 : 𝐶 :: 𝐵 : 𝐷.

Table 11Accuracy and F1 score (both in %) of 3 runs of the classification model. Embeddings used are concatenation of static information and clinical notes.EpochsValidInvalidF140 epochs 98.41 ± 1.56 68.22 ± 1.94 95.79 ± 0.5920 epochs 94.89 ± 1.74 72.08 ± 1.68 94.30 ± 0.8010 epochs 96.85 ± 1.75 70.31 ± 1.94 95.20 ± 0.71per hospital admission. Therefore, we kept the first 12 notes, and used padding (i.e., completionwith zeros) for hospital admissions with less than 12 notes.

Acknowledgments

Experiments presented in this paper were carried out using computational clusters equipped with GPU from the Grid'5000 testbed (see https://www.grid5000.fr).

The research work of the second and third named authors is partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation program under GA No 952215, and the Inria Project Lab "Hybrid Approaches for Interpretable AI" (HyAIAI).

For our second experiment, we used only unstructured data, i.e., the 𝑍 𝑛𝑜𝑡𝑒 part of the embedding for the patient-stay representations. Our patient-stay representations thus consisted of only the representation of clinical notes. The training loss was at the local minimum for 15, 20, and 40 epochs. Therefore, we trained our model for 15, 20, and 40 epochs, with 3 different random initializations in each case. As shown in Table 2, our model performs the best for positive examples when we train by 20 epochs.

For our third experiment, we used only structured data, i.e., 𝑍 𝑠𝑡𝑎𝑡𝑖𝑐 , to represent our patientstay representations. Our training loss was at the local minimum for 15, 20, and 40 epochs. Therefore, we trained our model for 15, 20, and 40 epochs, with 3 different random initializations in each case. We report our results in Table 3. As seen, the accuracy for positive examples is high for all cases compared to negative examples where the accuracy drops.

In all our experiments, we can see that our model performs the best for positive examples regardless of whether we use [𝑍 𝑠𝑡𝑎𝑡𝑖𝑐 ; 𝑍 𝑛𝑜𝑡𝑒 ], only 𝑍 𝑛𝑜𝑡𝑒 , or only 𝑍 𝑠𝑡𝑎𝑡𝑖𝑐 for the patient-stay representations. This can be explained as a result of the imbalance between positive and negative examples in the training data. Balancing the data would be the next step as it proved to be a good solution for [6] to get similar results for positive and negative examples. The accuracy for valid analogies is the highest when our embedding model is fed with only static information. Between the first and the second experiment, the accuracy is the highest for valid analogies when the patient-stay representations are made of the concatenation in contrast to when our patient-stay representations are made of only clinical notes. This indicates that adding or using static information when learning patient-stay representations, as in the first and third experiment, improves the performance of our model, where it allows the model to better distinguish the stays and to match them to the patient they belong to. We also notice that the accuracy for invalid analogies is the highest when the embedding model is fed with only clinical notes. For all performed experiments, the F1 score is high, which indicates that our model is able to correctly classify analogies to the class they belong to (i.e., valid or invalid).

Morphological predictability of unseen words using computational analogy RFam YLepage Workshops Proceedings for the Twenty-fourth International Conference on Case-Based Reasoning (ICCBR) 2016 1815 Deep visual analogy-making SEReed YZhang YZhang HLee Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) the Advances in Neural Information Processing Systems (NeurIPS) 2015 Solving word analogies: A machine learning perspective SLim HPrade GRichard Proceedings of the Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU) the Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU) 2019 11726 Glove: Global vectors for word representation JPennington RSocher CDManning Proceedings of the Empirical Methods in Natural Language Processing (EMNLP) the Empirical Methods in Natural Language Processing (EMNLP) 2014 Image analogies AHertzmann CEJacobs NOliver BCurless DSalesin Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH) the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH) 2001 A neural approach for detecting morphological analogies SAlsaidi ADecker PLay EMarquer P.-AMurena MCouceiro Proceedings of the 8th IEEE International Conference on Data Science and Advanced Analytics (DSAA) the 8th IEEE International Conference on Data Science and Advanced Analytics (DSAA) 2021 Semantic deep learning: Prior knowledge and a type of fourterm embedding analogy to acquire treatments for well-known diseases MACasteleiro JDDiz NMaroto MJ FPrieto SPeters CWroe CSTorrado DMFernandez RStevens JMIR Medical Informatics 8 2020 Word embedding for the french natural language in health care: comparative study EDynomant RLelong BDahamna CMassonnaud GKerdelhué JGrosjean SCanu SJDarmoni JMIR medical informatics 7 2019 Using deep learning towards biomedical knowledge discovery NNRather CPatel SAKhan International Journal of Mathematical Sciences and Computing 3 2017 IJMSC) Mimic-iii, a freely accessible critical care database AE WJohnson TJPollard LShen LWei H. Lehman MFeng MMGhassemi BMoody PSzolovits LACeli RGMark Scientific Data 3 2016 Deep representation learning of patient data from electronic health records (ehr): A systematic review YSi JDu ZLi XJiang TMiller FWang WJim Zheng KRoberts Journal of Biomedical Informatics 115 2021 Patient2vec: A personalized interpretable deep representation of the longitudinal electronic health record JZhang KKowsari JHHarrison JMLobo LEBarnes IEEE Access 6 2018 Patient representation learning and interpretable evaluation using clinical notes SMadhumita SSimon LKim DWalter Journal of biomedical informatics 84 2018 Combining structured and unstructured data for predictive models: a deep learning approach DZhang CYin JZeng XYuan PZhang BMC Medical Informatics and Decision Making 20 280 2020 A review of current patient matching techniques PWaruhari ABabic LNderu MCWere Informatics Empowers Healthcare Transformation (ICIMTH) 238 2017 Privacy-preserving data sharing infrastructures for medical research: systematization and comparison FNWirth TMeurers MJohns FPrasser BMC Medical Informatics Decision Making 21 242 2021 Why patient matching is a challenge: Research on master patient index (mpi) data discrepancies in key identifying fields BHJust DTMarc MMunns RHSandefer Perspectives in health information management 13 1 2016 Distributed representations of sentences and documents QVLe TMikolov Proceedings of the 31th International Conference on Machine Learning (ICML) the 31th International Conference on Machine Learning (ICML) 2014 32 Analogical dissimilarity: Definition, algorithms and two experiments in machine learning LMiclet SBayoudh ADelhay Journal of Artificial Intelligence Research 32 2008 YLepage De l'analogie rendant compte de la commutation en linguistique 2003 Universit'e Joseph-Fourier -Grenoble I Habilitation à diriger des recherches CAntic ArXiv abs/2006.02854 Analogical proportions 2020 Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals ALGoldberger LA NAmaral LGlass JMHausdorff PCIvanov RGMark JEMietus GBMoody C.-KPeng HEStanley Circulation 101 2000 Software Framework for Topic Modelling with Large Corpora RŘehůřek PSojka Proceedings of the LREC Workshop on New Challenges for NLP Frameworks the LREC Workshop on New Challenges for NLP Frameworks 2010 Exploring analogical inference in healthcare SAlsaidi MCouceiro ABurgun NGarcelon ACoulet Workshop on Interactions between Analogical Reasoning and Machine Learning (IARML) 2022 to appear