=Paper= {{Paper |id=Vol-3038/paper15 |storemode=property |title=The Application of Sequential Pattern Mining Techniques on MIMIC-IV |pdfUrl=https://ceur-ws.org/Vol-3038/paper9.pdf |volume=Vol-3038 |authors=Cecilia Mariciuc,Madalina Raschip |dblpUrl=https://dblp.org/rec/conf/iddm/MariciucR21 }} ==The Application of Sequential Pattern Mining Techniques on MIMIC-IV== https://ceur-ws.org/Vol-3038/paper9.pdf
The Application of Sequential Pattern Mining Techniques on
MIMIC-IV
Cecilia Mariciuca and Mădălina Răschipa
a
    Alexandru Ioan Cuza University of Iaşi, Bulevardul Carol I, Nr.11, Iaşi, 700506, Romania


                Abstract
                The paper studies the application of sequential pattern mining techniques to medical data
                from MIMIC-IV, a large healthcare dataset. Sequences of prescribed drugs to a large
                number of patients are analyzed in order to find out if there are patterns or temporal
                relationships which are general or specific to a particular disease. The PrefixSpan and
                Spade algorithms were applied to mine sequential patterns on all sequences or on a subset
                of them. The extracted patterns could be used to suggest the next prescribed drug. The
                experimental results show that the predictions obtained have a good accuracy for some
                diagnoses.

                Keywords 1
                sequential pattern mining, next prescribed drug, MIMIC- IV

1. Introduction

   The correct use of a drug is dependent upon several conditions. Each drug has some characteristics,
such as indications, possible risk factors and contraindications, like the use with other drugs or the
existence of certain medical conditions. The improper use of drugs and self-medication can be
dangerous [1].
   The advancement of technology has made it possible to digitally collect and store patient data for
their subsequent use. The manipulation of this large amount of data could bring new knowledge to
the medical field [2]. Medications prescribed by specialists can be used to identify the optimal treatment.
The order of the prescriptions could provide important information. Frequent subsequences or predictions
of the next drug can help a doctor in making a quick decision when there are too many medication
options. They can be used to make automatic recommendations in routine cases, or to verify the correctness
of unusual orders.
   Sequential pattern mining can be a solution to this problem because it can identify patterns of
ordered events [3]. A survey of the approaches proposed for sequential pattern mining is given in
[4], [13]. Sequential pattern mining was applied in different areas of research, also including the
medical domain. For example, to identify temporal relationships between drug prescription and
medical events or between prescriptions of different drugs [5], or to identify if a person is susceptible
to a future illness [6].
   In this paper, we used sequential pattern mining to predict the next medication for a patient. Other
existing studies in the literature are based on machine-learning methods. In [8], the prescription data

IDDM-2021: 4th International Conference on Informatics & Data-Driven Medicine, November 19–21, 2021 Valencia, Spain
EMAIL: cecilia.mariciuc27@gmail.com (A. 1); madalina.raschip@uaic.ro (A. 2);
ORCID: 0000-0003-0020-636X (A. 2);
             ©️ 2020 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
is transformed into a stochastic time series for prediction. Various machine-learning approaches were
used and analyzed in order to predict prescription patterns. A different approach is presented in [9].
The authors used neural networks and word2vec representations to predict the medication order
prescribed during hospitalization, which could be used to assist pharmacists. Good results were
obtained for obstetrics and gynecology patients and newborn babies. The paper [10] predicts
prescriptions for the next period of time based on the disease status, laboratory results and the previous
treatment of the patient through a framework of machine learning. The authors used three Long Short-
Term Memory models. The experiments were performed on data from the MIMIC-III ICU and other
data from hospitals in China. The results obtained reveal the effectiveness of the methods. Another
study [11] uses probabilistic topic modelling to predict clinical order patterns.
    A similar study to ours is presented in [7]. The authors describe an approach based on sequential
pattern mining to identify the next prescribed medication for patients with diabetes. The CSPADE
algorithm is used to mine sequential patterns at the drug class and generic drug level. The dataset
used in our research is different from the one considered in [7]. We used a larger real-world
dataset, MIMIC-IV, on which sequential pattern mining has not been applied before. The
preprocessing step of identification of drugs and the construction of sequences are specific to this
dataset. Two mining algorithms, PrefixSpan and SPADE, were considered. Although the
predictions are made in a similar way by constructing some rules from the frequent patterns, the
analysis of the mining algorithms on the MIMIC-IV dataset and the evaluation of the results on
several diagnoses such as "heart attack" are two other elements that distinguish the current paper
from the existing works.
    The paper is organized as follows. A formal description of the problem of mining sequential
patterns and the algorithms used to solve the problem is given in Section 2. In Section 3 we present
the dataset used and in Section 4 the experimental settings and results. We conclude with a summary
and future improvements in Section 5.

2. Sequential Pattern Mining

   The problem that sequential pattern mining is trying to solve can be described as follows: knowing
that many events occur in time, can we learn more about this data if we analyse any ordered sequence
encountered? [13]
   In the following we formally describe the problem. Let 𝐼 = {𝑖1 , 𝑖2 , … , 𝑖𝑛 } be a set of elements, also
called an alphabet. An event (𝑖𝑥1 , 𝑖𝑥2 , … , 𝑖𝑥𝑘 ), 1 ≤ 𝑥𝑗 ≤ 𝑛, ∀ 𝑗 ∈ {1, … , 𝑘} is a nonempty subset of 𝐼 and
an unordered collection of elements. A sequence 〈𝑒1 , 𝑒2 , … , 𝑒𝑞 〉 is an ordered collection of events. A
sequence that contains k elements is known as a k-sequence. A sequence 𝑠𝑒 = 〈𝑒1 , 𝑒2 , … , 𝑒𝑛 〉 is a
subsequence of the sequence 𝑠𝑓 = 〈𝑓1 , 𝑓2 , … , 𝑓𝑚 〉 if there exist integers 1 ≤ 𝑖1 < 𝑖2 < ⋯ < 𝑖𝑛 ≤ 𝑚 such
that 𝑒1 ⊆ 𝑓𝑖1 , 𝑒2 ⊆ 𝑓𝑖2 , … , 𝑒𝑛 ⊆ 𝑓𝑖𝑛 . A sequence database is a set of sequences that have associated
identifiers. The support of a sequence s, denoted sup(s), in a sequence database represents the number
of sequences containing s, i.e., for which s is a subsequence. Giving a value for the minimum support,
denoted 𝑚𝑖𝑛𝑠𝑢𝑝, a sequence is considered frequent in a database if its support is at least equal to the
𝑚𝑖𝑛𝑠𝑢𝑝. Sequential pattern mining aims to find these frequent sequences.

2.1. SPADE

   SPADE (Sequential PAttern Discovery using Equivalence classes) [14] is an Apriori-based
algorithm, making use of the Apriori property that claims that any subsequence of a frequent
sequence is also a frequent sequence. SPADE works with data organized in vertical format, by
transforming the initial sequence database into a table composed of all events where a row is an event
linked with the corresponding sequence identifier (SID) and its position in the sequence (EID).
   At each step k, the algorithm searches for k-sequences that have the chance to be frequent, by
generating id-lists. The first step is to find the 1-frequent sequences. Support is calculated for each
element of the alphabet, counting the entries in the vertical formatted table that contains it. Those
entries will be included in its id-list. Subsequently, only items that reach the minimum support are
frequent and will be considered for finding 2-frequent sequences. In the general case, candidate k-
sequences are found by joining the id-lists of any two frequent (k-1)-sequences, that have the same
SID and have ordered sequential positions (EIDs). The algorithm stops when no more frequent
sequences have been found or no more candidate sequences have been constructed.

2.2. PrefixSpan
   PrefixSpan (Prefix-Projected Sequential Patterns Mining) [15] is a Pattern-Growth-based algorithm,
because it does not generate candidate sequences, but instead uses partitioning of the data set into
projections, which will be explored separately to extend the already known frequent sequences.
   The PrefixSpan algorithm includes the following steps:
   1. Find 1-frequent sequences in the dataset that will later be concatenated to the current frequent
   sequence (or the current frequent prefix) to form new frequent sequences. Initially, the current frequent
   sequence is an empty sequence, 𝑠 = 〈_〉 .
   2. The search space is partitioned according to the sequences found in the previous step. For each
   new, frequent sequence obtained, a projection is created, considering that sequence as a prefix.
   3. For each projection, look for the elements with support at least equal to 𝑚𝑖𝑛𝑠𝑢𝑝 which will be used
   to extend the previous frequent sequences.
   These steps are repeated recursively, the algorithm operating on a divide et impera strategy.

3. MIMIC-IV dataset

   MIMIC (Medical Information Mart for Intensive Care) is a relational database, publicly accessible
which documents the hospitalizations of patients at Beth Israel Deaconess Medical Center (BIDMC)
in Boston, MA, USA. MIMIC-IV [12] is the latest version of the MIMIC database and represents an
improvement of MIMIC-III, with a modular structure and more recent patient data from 2008 to 2019.
MIMIC-IV contains five modules that reflect the origin of the data: core, hosp, icu, ed and cxr.
We used the hosp module which provides information from the electronic medical records that
include laboratory tests, medications, and diagnoses. From this module, the following tables were
used: prescriptions, diagnoses_icd and d_icd_diagnoses. The prescriptions table contains information
about the prescribed medications. The drug type field has three possible values: MAIN, BASE, or
ADDITIVE. The diagnoses_icd table records the diagnoses for which a patient was billed. Each
diagnosis has associated a seq_num which represents the importance of the diagnosis. The lower the
seq_num is, the more significant the diagnosis is. The official name of a diagnosis can be identified
using the table d_icd_ diagnoses.
                   Figure 1: Distribution of the number of drugs per hospitalization

   The prescriptions table contains 17008053 records, i.e., drugs that were individually prescribed. In
most prescriptions, the drug type was in the MAIN category. Prescriptions were made for 232064
patients, with 452115 hospitalizations. A distribution of the number of drugs per hospitalization is
available in Figure 1. In most cases, this number falls in the range [0,400], although there are also
much higher values (a maximum of 2156).
   There are 5280351 diagnoses in the associated table diagnoses_icd, established for 255106 patients
who had 521111 hospitalizations. A patient may have several hospitalizations, and for each
hospitalization, several diagnoses. The distribution of the number of diagnoses per hospitalization is
given in Figure 2.




                 Figure 2: Distribution of the number of diagnoses per hospitalization

   The d_icd_diagnoses table contains 109775 lines, or possible diagnoses. Table 1 shows the ranking
of the most common diagnoses.
Table 1
Distribution of the number of diagnoses per hospitalization
   No.                                     Diagnosis                                No. of occurrences
    1                       Unspecified essential hypertension                           104080
    2                      Other and unspecified hyperlipidemia                           68215
    3                         Essential (primary) hypertension                            54696
    4                           Hyperlipidemia, unspecified                               51097
    5                                 Esophageal reflux                                   49593
    6           Diabetes mellitus without mention of complication, type II or             43705
                        unspecified type, not stated as uncontrolled
    7                    Personal history of nicotine dependence                          40803
    8                                  Atrial fibrillation                                37337
    9                  Depressive disorder, not elsewhere classified                      36905
   10                      Congestive heart failure, unspecified                          36891
   11                Coronary atherosclerosis of native coronary artery                   36404
   12              Gastro-esophageal reflux disease without esophagitis                   35610
   13          Need for prophylactic vaccination and inoculation against viral            32686
                                           hepatitis
   14                          Personal history of tobacco use                            32225
   15              Major depressive disorder, single episode, unspecified                 30398
   16                         Acute kidney failure, unspecified                           29276
   17                      Unspecified acquired hypothyroidism                            29051
   18                           Encounter for immunization                                27146
   19         Atherosclerotic heart disease of native coronary artery without             26706
                                       angina pectoris
   20                               Tobacco use disorder                                  26340



4. Experimental results
   This section describes the steps followed in generating predictions using Sequential Pattern Mining
algorithms on the MIMIC-IV dataset, as shown in Figure 3. The steps are the following: finding the list of
distinct drugs, filtering hospitalizations by diagnoses, building sequences of drugs, running sequential
pattern algorithms and building rules.




                                 Figure 3: Pipeline for the strategy used
      The cases where the predictions are relevant and the parameters that influence their accuracy are
analyzed.


4.1. Preprocessing

    The same drug may appear in prescriptions in several forms, such as various abbreviations (‘hepa’,
‘hepar’, ‘hepari’, ‘heparin’), some of the letters are capitalized (‘acetaZOLAMIDE’, ‘Acetazolamide’,
‘AcetaZOLamide’), more or less spaces and special characters (’Dextromethorphan-’,
’Dextromethorphan’), additional words, such as ’pain’, ’bulk’, ’extended release’ (’vancomycin’,
’vancomycin (bulk)’). Another, more complex problem, is that medicines may appear under completely
different names, i.e. with the generic name, or with the name used by the brand. A solution to all these
inconsistencies is the usage of the gsn field, which contains one or more 6 digit Generic Sequence Number
(GSN) codes. GSN identifies a product based on its formula, dose, method of administration and
concentration and can be used to group generally equivalent products, which may differ only through the
manufacturer. In order to reduce the existence of several equivalent elements, we created a list of drugs
with a unique id associated with the help of the GSN codes. Since a drug or other equivalent drugs can be
associated with several GSN codes, groups of GSN codes will be established so that one group contains all
codes that have been mentioned together directly or indirectly. Two drugs will be considered equivalent if
at least one of their GSN codes (not necessarily identical) is found in the same group of GSN codes. Thus,
starting from a list of 16970 pairs (drug, gsn), we obtained a list of 3398 drugs with a unique id after
preprocessing.

4.2. The construction of sequences

    A sequence is an ordered list of events of the form 〈𝑒1 , 𝑒2 , … , 𝑒𝑛 〉 and initially the events are empty
subsets of the alphabet I. In our case, the alphabet is the set of drugs ids 𝐼 = {0,1,2, … ,3397}. A sequence
corresponds to a hospitalization and is represented by the list of ids of the drugs prescribed, grouped and
sorted by time. For example, the sequence 〈(2624), (2624), (2769, 539, 1100)〉 specifies that in the case
of a hospitalization, the drug with id 2624 was prescribed first, then again, the same drug, and then followed
by a group of three drugs.
    We considered two cases for the generation of sequences: the sequences are built for all
hospitalizations, or only on a subset of hospitalizations. In the first case, the list of distinct
hospitalizations that have at least one prescription can be easily found by querying the prescriptions
table. For each element of the list, the events of the corresponding sequence are considered. Given that
there are 452 115 distinct hospitalizations, the number of generated sequences is high, fact which limits
the competence of mining algorithms. Consequently, for the second case, we considered filtering the
hospitalizations after one or more diagnoses. Given a set of keywords, we will search for
hospitalizations that have diagnoses that contain all the keywords. For example, for the words ‘heart’
and ‘pneumonia’, hospitalization with the following diagnoses ‘Pneumonia due to adenovirus’,
‘Aneurysm of heart’, ‘Other and unspecified hyperlipidemia’ will be selected. In addition to this
filtering, when constructing sequences, only prescriptions with a drug type equal to MAIN will be
considered.
    This filtering is meant to facilitate the use of fewer resources (time and memory) by algorithms and to
obtain better results, because the selection of hospitalizations by diagnoses can increase the chance of
finding more common patterns.
4.3. Sequence pattern mining

   The frequent sequences of prescribed drugs were extracted using two sequential pattern mining
algorithms, SPADE and PrefixSpan, available in the open-source Java library SPMF [16]. We run the
algorithms on an instance based on Windows 10 Pro that has an Intel(R) Core (TM) i7-8550U CPU
@ 1.80GHz processor with 8 GiB of memory.

   SPADE cannot be applied to the entire dataset due to the additional memory the algorithm
requires to transform the sequence database into a vertical format. SPADE is suitable to be used for a
subset of hospitalizations. The results of SPADE are given in Table 2.

Table 2
The results of SPADE
   Diagnosis      diagnoses    𝑚𝑖𝑛𝑠𝑢𝑝        sequences         Avg no.    frequent    Time     Memory
                                          (hospitalizations)      of     sequences     (s)      (mb)
                                                               events
  Heart failure      73         0.025           52086           21.04     80983       86.94    1556.81
      Born in        21         0.008           37113            2.77     44020       41.21    407.844
     hospital
  Acute kidney       15         0.025           48255          24.92      68860      162.21    906.58
      failure
    Need for          1         0.0001          29177           1.44      96546       44.98    329.158
  prophylactic
   vaccination
       and
   inoculation
  against viral
     hepatitis
  Circumcision        2         0.0001          13269           2.05      262264      45.61    329.158
 Encounter for        1         0.0011          20149           1.88       73801      51.04     415.74
 immunization

   We considered six use cases, i.e., hospitalizations that had the following diagnoses: Heart failure,
Born in hospital, Acute kidney failure, Need for prophylactic vaccination and inoculation against viral
hepatitis, circumcision and Encounter for immunization. We selected the hospitalizations for a
diagnosis based on some terms. The chosen terms are contained in or represent the names for the most
common diagnoses. We consider diagnoses with 𝑠𝑒𝑞_𝑛𝑢𝑚 ≥ 5 because of the higher chance that they
will be the main reason for the hospitalization. For example, there are 73 different diagnoses containing
the term 'heart failure' and for which this diagnosis is important (𝑠𝑒𝑞_𝑛𝑢𝑚 ≥ 5). One of the most
common diagnoses is Congestive heart failure, unspecified, according to Error! Reference source not
found.. The number of resulted sequences is 52086, with an average of 21,04 events. A number of 80983
frequent sequences were found by applying the algorithm on the sequences. The selected values for
the minimum support are specified in Table 2. The value of 𝑚𝑖𝑛𝑠𝑢𝑝 is empirically chosen for practical
time limits. A lower number of events means that fewer medications are prescribed for those
hospitalizations, which allows the choice of a lower minimum support.
   PrefixSpan The parameters of the algorithm are the value of the minimum support and, optionally,
the maximum length of the sequences.
   We tested the algorithm for all hospitalizations, using a minimum support of 0.025 (10891
sequences) and a maximum length of a sequence of 20. The number of frequently found sequences is
equal to 9771. Some of the most frequently used drugs are: lactated ringers (Ringer’s Lactate
Solution), hydralazine (Hydralazine), tylenol (Paracetamol), 0.9% sodium chloride, potassium
chloride, heparin flush, etc.
   A small decrease in the minimum support can significantly increase the number of sequences and
thus the execution time. For example, for a minimum support of 0.02 (8713 sequences), the number
of frequent sequences increases to 20968.
   Next, we repeated the tests made with SPADE, but using the PrefixSpan algorithm instead.
The results are given in Table 3.

Table 3
The results of PrefixSpan
        Diagnosis            𝑚𝑖𝑛𝑠𝑢𝑝          Frequent sequences          Time (s)     Memory (mb)
      Heart failure           0.025               101517                   319          662.53
    Born in hospital          0.008                44601                  29.59         241.61
 Acute kidney failure         0.025                86110                 726.23         555.51
        Need for             0.0001                99975                   0.77         100.57
      prophylactic
    vaccination and
  inoculation against
     viral hepatitis
      Circumcision            0.0001                268516                 1.68         95.39
     Encounter for            0.0011                 93614                 6.4          114.12
     immunization

   PrefixSpan finds more frequent sequences and uses less memory than SPADE. However, the
execution time is significantly shorter for SPADE when we have long sequences, and shorter for
PrefixSpan in case of short sequences.
   We next analyzed the frequent sequences that resulted from the application of the algorithms. We
considered two cases: with a high minimum support (for example, hospitalizations with heart failure
diagnosis) and with a low minimum support (for example, hospitalizations with Need for prophylactic
vaccination and inoculation against viral hepatitis diagnosis).
   Some frequent sequences found for hospitalizations with Need for prophylactic vaccination and
inoculation against viral hepatitis diagnosis are given in Table 4.

Table 4
Sequential patterns
                                  Sequential patterns                                   Support
                               ‘erythromycin ophthalmic’                                 27346
  ’erythromycin ophthalmic’ ’phytonadione (vitamin k1)’ → ’erythromycin ophthalmic’        11
                   ’hepatitis b vaccine’ ’phytonadione (vitamin k1)’

          ’triple dye’ ’erythromycin ophthalmic’ ’hepatitis b immune globulin’             37
                                  ’phytonadione (vitamin k1)’                                                                       27347
                                      ’hepatitis b vaccine’                                                                          7876
                         ’phytonadione (vitamin k1)’ → ’gentamicin’                                                                  1324
                       ’phytonadione (vitamin k1)’ → ’acetaminophen’                                                                 2356
                    ’lidocaine’ ’acetaminophen’ → ’hepatitis b vaccine’                                                               189
                              ’triple dye’ → ’hepatitis b vaccine’                                                                   2525
          ’triple dye’ ’erythromycin ophthalmic’ ’hepatitis b immune globulin’                                                         37
  ’erythromycin ophthalmic’ ’phytonadione (vitamin k1)’→’phytonadione (vitamink1)’                                                   7428

   We analyzed the sequential patterns in order to identify the most commonly used medications. For
heart failure diagnoses, the most common drugs among the frequent sequences are: tylenol, senna
laxative, aspirin, docusate sodium, dextrose, furosemide, metoprolol tartrate, glucagon, etc. Most of
these drugs are also common in all hospitalizations, making difficult to say whether they are
specific to these types of hospitalizations or not. Consequently, we manually searched for drugs
known to be common for the treatment of heart failure2. We give next some of the results:
   •    from the class Angiotensin-Converting Enzyme (ACE) Inhibitors: lisinopril is often found, being
   contained in over 400 sequences, captopropyl is found in 6 sequences, with support in the range [1000-
   2200]
   •    from the Beta Blockers class: carvedilol appears in over 100 sequences; metoprolol is one of the
   most frequently found drug
   •    from the class of Vasodilators: hydralazine is found in many different forms, nitroglycerin
   is found in over 500 sequences
   •    from the class of Diuretics: furosemide is one of the most common drugs, torsemide is found in
   over 500 sequences, metallozone is found only individually
   For patients diagnosed with Need for prophylactic vaccination and inoculation against viral
hepatitis, the drugs prescribed are less varied, most of them being hepatitis b immune globulin (bayhep
b), hepatitis b vaccine, vitamin k, gentamicin, erythromycin ophthalmic, tylenol, heparin, triple dye.
In addition to the vaccine itself (current diagnosis indicates the need for hepatitis vaccination), usual
drugs are found, or drugs specific to newborns, because the hepatitis B vaccine is administered to
them immediately after birth.

4.4. Make predictions using frequent sequences
   As we previously specified, the frequent sequences can be utilised to identify drugs used for
different diagnoses. But when the number of sequences is huge, this approach becomes less relevant
and time expensive. Frequent drug sequences can reveal which drugs or combinations of drugs are
more likely to be recommended when we know the previous prescriptions. We will predict the most
likely drugs to be prescribed and compare the result with the real values to determine the accuracy
of the predictions.
   Rules construction To describe the links between the drugs from frequent sequences, we will
generate rules of form (antecedent, consequent, support) with the following meanings:



   2
       https://www.nhs.uk/conditions/heart-failure/treatment/,

       https://www.heart.org/en/health-topics/heart-failure/treatment-options-for-heart-failure/medications-used-to-treat-heart-failure
  •      For a sequence 𝑠 = 〈𝑒1 , 𝑒2 , … , 𝑒𝑛 〉, the antecedent will contain the first (n−1) events
  〈𝑒1 , 𝑒2 , … , 𝑒𝑛−1 〉, and the consequent will be the last event 𝑒𝑛 . Only sequences containing at least two
  elements are considered.
  •      The support of the rule will correspond to the sequence support.
  Some examples of rules generated from frequent sequences for Need for pro phylactic vaccination
and inoculation against viral hepatitis diagnosis are given in Table 5.

Table 5
Rules generated for the Need for prophylactic vaccination and inoculation against viral hepatitis diagnosis
            Antecedent                                   Consequent                            Support
    {’erythromycin ophthalmic’,       {’erythromycin ophthalmic’ ’hepatitis b vaccine’             11
    ’phytonadione (vitamin k1)’}                 ’phytonadione (vitamin k1)’}

   {’phytonadione (vitamin k1)’}                            {’gentamicin’}                         1324
   {’phytonadione (vitamin k1)’}                         {’acetaminophen’}                         2356
   {’lidocaine’, ’acetaminophen’}                      {’hepatitis b vaccine’}                     189
             {’triple dye’}                            {’hepatitis b vaccine’}                     2525
    {’erythromycin ophthalmic’,                    {’phytonadione (vitamin k1)’}                   7428
    ’phytonadione (vitamin k1)’}


    Predictions using rules Before making any predictions, the list of rules is sorted using a multi-
level approach: first, descending by the number of events from the antecedent and then descending
by support. To narrow the search space, we also created a threshold dictionary as follows: for each
length of the antecedent that exists in the previously sorted list, store the index of the first
corresponding rule. For example, the following threshold dictionary, denoted
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑠 = {8: 0, 7: 97, 6: 1178, 5: 6174, 4: 17020, 3: 29901, 2: 39094, 1: 43006, 0: 43908} reveals
that there are eight distinct lengths of the rules’ antecedent. The rules that have an antecedent
containing x events, 1 ≤ 𝑥 ≤ 8, will be found in the list starting with position 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑠[𝑥] and up
to position 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑠[𝑥 − 1] − 1.
    Having a patient’s prescribed medication sequence during a hospitalization 𝑠 = 〈𝑒1 , 𝑒2 , … , 𝑒𝑛 〉
and a sorted list of rules, the predictions will be made as follows:
    1. If 𝑛 ≥ 1, iterate through the rules with the number of events from the antecedent equal to the number
         of events in s. The threshold list will be used.
    2. For each rule, check if there is a match between the antecedent and the sequence s. If a match is
    found, the event from the consequent is added to a list.
    3. If five matches are found, the search ends. Otherwise, the first event from the sequence s is removed
    and the previous steps are repeated. Deleting the first item from s means, in fact, that we are trying to
    test on the patient’s more recent history.
    At the end, the list of maximum five events represents the predictions of drugs for the patient with
the sequence s as history.
    To test the accuracy of the predictions, we used the hospitalizations for which the frequent sequences
were found and which, implicitly, were used to generate the rules and the dictionary of thresholds.
Denote by 𝑝𝑚𝑎𝑥 the maximum value of a key in the threshold dictionary, or the maximum length of
an antecedent for the current rules. The sequences of each hospitalization are divided into segments
of length 𝑝𝑚𝑎𝑥 . If they are not divided exactly, the last segment will be considered if its length is
at least two. The last event is removed from each segment, as it will be used to verify the correctness
of the predictions. Predictions are made based on these segments, and if at least one of the drugs
contained in the predicted events is found in the event set aside, then we will consider the prediction
is correct. Accuracy is computed as the percentage of correct predictions out of the total predictions
made.

   Predictions results The prediction results are given in Table 6.

Table 6
Predictions results
   Diagnosis        Algorithm        Support       𝑝𝑚𝑎𝑥      Sequences Segments         Accuracy    Runtime
                                                                                                      (sec)
  Heart failure       SPADE           0.025          12          25%         22139       25.83%     2914.35
      Born in         SPADE           0.008           8         100%         18165       65.15%      858.18
     hospital
    Need for        PrefixSpan        0.0001         13         100%         9962        80.14%      106.96
  prophylactic
   vaccination
       and
   inoculation
  against viral
     hepatitis
  Circumcision      PrefixSpan        0.0001         13         100%         10854       89.32%      37.07
 Encounter for      PrefixSpan        0.0011          7          50%         3965        55.88%      266.22
 immunization

    For the heart failure diagnosis, for example, for 5718 sequences at least one correct prediction was
obtained, meaning an accuracy of 25.83%, and for 4704 we could not find any prediction. If we take into
account only the sequences on which predictions were found, then the accuracy would be 32.79%.
    For certain diagnostics, like Need for prophylactic vaccination and inoculation against viral hepatitis
and Circumcision the accuracy is high, while for other diagnoses like Heart failure it is small. Statistically,
the number of prescriptions increases with age [17]. Intuitively, a diagnosis that contains the term ‘born in
hospital’ refers to newborns, in which case certain standard medicines are required. The number of allowed
drugs is lower (many drugs have age restrictions). In this case, it is easier to identify which drugs are
more likely to be prescribed. The diagnoses Need for prophylactic vaccination and inoculation against
viral hepatitis and Encounter for immunization indicate that a person needs administration of a
vaccine. The person is not necessarily ill, so the number of drugs is not expected to be high. Instead,
diagnoses that contain ’heart failure’ indicate a serious, complex condition that is often found in the
population over the age of 65.
    To better clarify the possible reasons that affect the accuracy of predictions, we analysed other
measures detailed in Table 7. The second column contains the total number of different drugs
encountered in the sequences. The next column contains the average number of drugs per sequence.
The last column contains the average difference between the date of the last prescription and the date
of the first prescription.
Table 7
Measures that can influence the accuracy of predictions
          Diagnosis               Total drugs           Drugs per sequence         Time diff (days)
        Heart failure                2155                      45.82                     6
      Born in hospital                305                       4.92                     3
   Need for prophylactic              200                       2.96                     1
      vaccination and
  inoculation against viral
          hepatitis
        Circumcision                  112                       3.76                      1
       Encounter for                  864                       3.84                      1
       immunization

    According to Table 7, when there is a wider range of drugs to choose from, the accuracy tends to
decrease. The average number of drugs per sequence influences the sequential pattern mining algorithms:
it is necessary to usually choose a larger support, so as not to use too much memory, fact which also
influences the accuracy. Another parameter that could influence the results is the length of the period
in which prescriptions were made. This may indicate complex diagnoses or, conversely, less severe
cases.
    The choice of the minimum support can influence the accuracy of the predictions, and indirectly
the runtime and the memory. Table 8 exemplifies the way the support influences the accuracy. The
last column is the time needed to compute the predictions.

Table 8
The influence of support on the accuracy of predictions
       Diagnosis              Support         Frequent sequences       Accuracy          Time (sec)
     Circumcision               0.01                   246              85.54%              0.39
                               0.001                  2505              88.50%              1.38
                              0.0001                 268516             89.32%             37.07

       Need for                0.01                  132                71.36%                0.12
     prophylactic             0.001                 1413                77.13%                0.45
   vaccination and            0.0001                99975               80.14%                6.97
 inoculation against
    viral hepatitis

  As the minimum support decreases, the accuracy increases slightly, and the runtime also increases.
Lowering the support is useful up to a certain limit, for which a reasonable execution time is obtained

5. Conclusions

   Sequential Pattern Mining represents an effective technique to make predictions of medications
based on the patient’s past prescription history. This paper studies in particular the application of
two algorithms, SPADE and PrefixSpan, as a means to find frequent sequences that reveal temporal
relationships between medications. The resulting frequent sequences are general or specific to one or
more diseases and are used to construct rules. Predictions are made by finding matches of a patient’s
medication history in the list of rules. According to the experimental results, there are situations in
which the predictions made can reach a satisfactory accuracy. Such a solution is especially useful for
routine cases, for instance, immunizations, or for the treatment of newborns. Instead, for more complex
diagnoses, additional study is needed to optimize the results.
   Some improvements that can be made are the addition and the usage of supplementary patient
information, such as laboratory results, age and supplementary medication details, like the dose, the
method of administration.

6. References

[1] Schmiedl, S., Rottenkolber, M., Hasford, J., Rottenkolber, D., Farker, K., Drewelow, B., and
     Thürmann, P. Self-medication with over-the-counter and prescribed drugs causing adverse-drug-
     reaction-related hospital admissions: results of a prospective, long-term multi-centre study. Drug
     safety (2014): 37(4):225-235. doi: 10.1007/s40264-014-0141-3.
[2] Prather, J. C., Lobach, D. F., Goodwin, L. K., Hales, J. W., Hage, M. L., and Hammond, W. E. Medical
     data mining: knowledge discovery in a clinical data warehouse Proceedings: a conference of the
     American Medical Informatics Association. AMIA Fall Symposium (1997): 101-105.
[3] Agrawal, R., and Srikant, R. Mining sequential patterns. Proceedings of the Eleventh International
     Conference on Data Engineering (1995): 3-14
[4] Fournier-Viger, P., Lin, J. C. W., Kiran, R. U., Koh, Y. S., and Thomas, R. A survey of sequential
     pattern mining. Data Science and Pattern Recognition, 1(1), 54-77(2017)
[5] Norén, G. N., Bate, A., Hopstadius, J., Star, K., and Edwards, I. R. Temporal pattern discovery for
     trends and transient effects: its application to patient records. In Proceedings of the 14th ACM
     SIGKDD international conference on Knowledge discovery and data mining (2008): 963-971. doi:
     10.1145/1401890.1402005.
[6] Reps, J., Garibaldi, J. M., Aickelin, U., Soria, D., Gibson, J. E., and Hubbard, R. B. Discovering
     sequential patterns in a UK general practice database. In Proceedings of 2012 IEEE-EMBS
     International Conference on Biomedical and Health Informatics (2012): 960-963. doi:
     10.1109/BHI.2012.6211748.
[7] Wright, A., Wright, A., McCoy, A., Sittig, D.: The use of sequential pattern mining to predict next
     prescribed medications. Journal of biomedical informatics 53 (2015): 73-80. doi:
     10.1016/j.jbi.2014.09.003.
[8] Helgason, ́I.S. Predicting prescription patterns (Doctoral dissertation, Massachusetts Institute of
     Technology) (2008)
[9] Thibault, M., Lebel, D.: An application of machine learning to assist medication order review by
     pharmacists in a health care center. (2019). https://doi.org/10.1101/19013029.
[10] Jin, B., Yang, H., Sun, L., Liu, C., Qu, Y., Tong, J. A treatment engine by predicting next-period
     prescriptions. In Proceedings of the 24th ACM SIGKDD Inter-national Conference on Knowledge
     Discovery & Data Mining, (2018): 1608-1616. doi: 10.1145/3219819.3220095.
[11] Chen, J. H., Goldstein, M. K., Asch, S. M., Mackey, L., & Altman, R. B. Predicting inpatient clinical
     order patterns with probabilistic topic models vs conventional order sets. Journal of the American
     Medical Informatics Association, 24(3), 472-480(2017). doi: 10.1093/jamia/ocw136
[12] Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., Mark, R. MIMIC-IV (version 1.0).
     PhysioNet (2021)
[13] Mooney, C.H., Roddick, J.F. Sequential pattern mining–approaches and algorithms. ACM Computing
     Surveys (CSUR), 45(2), 1-39 (2013). doi: 10.1145/2431211.2431218.
[14] Zaki, M. J. SPADE: An efficient algorithm for mining frequent sequences. Machine learning, 42(1),
     31-60 (2001). doi: 10.1023/A:1007652502315.
[15] Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., & Hsu, M.C. Mining sequential
     patterns by pattern-growth: The prefixspan approach. IEEE Transactions on knowledge and data
     engineering, 16(11), 1424-1440 (2004). doi: 10.1109/TKDE.2004.77
[16] Fournier-Viger, P., Lin, J. C. W., Gomariz, A., Gueniche, T., Soltani, A., Deng,Z., & Lam, H. T. The
     SPMF open-source data mining library version 2. In Joint European conference on machine learning
     and knowledge discovery in databases (pp. 36-40) (2016).
[17] Martin, C. B., Hales, C. M., Gu, Q., & Ogden, C. L. Prescription drug use in the United States, 2015–
     2016, (2019): 1-8.