<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IDDM-</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Application of Sequential Pattern Mining Techniques on MIMIC-IV</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cecilia Mariciuc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mădălina Răschip</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alexandru Ioan Cuza University of Iaşi</institution>
          ,
          <addr-line>Bulevardul Carol I, Nr.11, Iaşi, 700506</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>4</volume>
      <fpage>19</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>The paper studies the application of sequential pattern mining techniques to medical data from MIMIC-IV, a large healthcare dataset. Sequences of prescribed drugs to a large number of patients are analyzed in order to find out if there are patterns or temporal relationships which are general or specific to a particular disease. The PrefixSpan and Spade algorithms were applied to mine sequential patterns on all sequences or on a subset of them. The extracted patterns could be used to suggest the next prescribed drug. The experimental results show that the predictions obtained have a good accuracy for some diagnoses.</p>
      </abstract>
      <kwd-group>
        <kwd>1 sequential pattern mining</kwd>
        <kwd>next prescribed drug</kwd>
        <kwd>MIMIC-IV</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The correct use of a drug is dependent upon several conditions. Each drug has some characteristics,
such as indications, possible risk factors and contraindications, like the use with other drugs or the
existence of certain medical conditions. The improper use of drugs and self-medication can be
dangerous [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The advancement of technology has made it possible to digitally collect and store patient data for
their subsequent use. The manipulation of this large amount of data could bring new knowledge to
the medical field [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Medications prescribed by specialists can be used to identify the optimal treatment.
The order of the prescriptions could provide important information. Frequent subsequences or predictions
of the next drug can help a doctor in making a quick decision when there are too many medication
options. They can be used to make automatic recommendations in routine cases, or to verify the correctness
of unusual orders.
      </p>
      <p>
        Sequential pattern mining can be a solution to this problem because it can identify patterns of
ordered events [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A survey of the approaches proposedfor sequential pattern mining is given in
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Sequential pattern mining was applied in different areas of research, also including the
medical domain. For example, to identify temporal relationships between drug prescription and
medical events or between prescriptions of different drugs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], or to identify if a person is susceptible
to a future illness [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this paper, we used sequential pattern mining to predict the next medication for a patient. Other
existing studies in the literature are based on machine-learning methods. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the prescription data
is transformed into a stochastic time series for prediction. Various machine-learning approaches were
used and analyzed in order to predict prescription patterns. A different approach is presented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The authors used neural networks and word2vec representations to predict the medication order
prescribed during hospitalization, which could be used to assist pharmacists. Good results were
obtained for obstetrics and gynecology patients and newborn babies. The paper [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] predicts
prescriptions for the next period of time based on the disease status, laboratory results and the previous
treatment of the patient through a framework of machine learning. The authors used three Long
ShortTerm Memory models. The experiments were performed on data from the MIMIC-III ICU and other
data from hospitals in China.The results obtained reveal the effectiveness of the methods. Another
study [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]uses probabilistic topic modelling to predict clinical order patterns.
      </p>
      <p>
        A similar study to ours is presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The authors describe an approach based on sequential
pattern mining to identify the next prescribed medication for patients with diabetes. The CSPADE
algorithm is used to mine sequential patterns at the drug class and generic drug level. The dataset
used in our research is different from the one considered in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We used a larger real-world
dataset, MIMIC-IV, on which sequential pattern
mining has not been applied before. The
preprocessing step of identification of drugs and the construction of sequences are specific to this
dataset. Two
      </p>
      <p>mining algorithms, PrefixSpan and SPADE, were considered. Although the
predictions are made in a similar way by constructing some rules from the frequent patterns, the
analysis of the mining algorithms on the MIMIC-IV dataset and the evaluation of the results on
several diagnoses such as "heart attack" are two other elements that distinguish the current paper
from the existing works.</p>
      <p>The paper is organized as follows. A formal description of the problem of mining sequential
patterns and the algorithms used to solve the problem is givenin Section 2. In Section 3 we present
the dataset used and in Section 4 the experimental settings and results. We conclude with a summary
and future improvements in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Sequential Pattern Mining</title>
      <p>
        The problem that sequential pattern mining is trying to solve can be describedas follows: knowing
that many events occur in time, can we learn more about this data if we analyse any ordered sequence
encountered? [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
      </p>
      <p>In the following we formally describe the problem. Let  = { 1,  2, … ,   } be a set of elements, also
called an alphabet. An event (  1</p>
      <p>,   2, … ,    ), 1 ≤   ≤  , ∀  ∈ {1, … ,  } is a nonempty subset of  and
an unordered collection of elements. A sequence 〈 1,  2, … ,   〉 is an ordered collection of events. A
sequence that contains k elements is known as a k-sequence. A sequence   = 〈 1,  2, … ,   〉 is a
subsequence of the sequence   = 〈 1,  2, … ,   〉 if there exist integers 1 ≤  1 &lt;  2 &lt; ⋯ &lt;  
≤ 
such
that  1 ⊆   1,  2 ⊆   2, … ,  
⊆   . A sequence database is a set of sequences that have associated</p>
      <p>identifiers. The support of a sequence s, denoted sup(s), in a sequence database represents the number
of sequences containing s, i.e., for which s is a subsequence. Giving a value for the minimum support,
, a sequence is considered frequentin a database if its support is at least equal to the
. Sequential pattern mining aims to find these frequent sequences.
denoted 
2.1. SPADE</p>
      <p>
        SPADE (Sequential PAttern Discovery using Equivalence classes) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is an Apriori-based
algorithm, making use of the
      </p>
      <sec id="sec-2-1">
        <title>Apriori property that claims that any subsequence of a frequent</title>
        <p>sequence is also a frequent sequence. SPADE works with data organized in vertical format, by
transforming the initial sequence database into a table composed of all events where a row is an event
linkedwith the corresponding sequence identifier (SID) and its position in the sequence(EID).</p>
        <p>At each step k, the algorithm searches for k-sequences that have the chance to be frequent, by
generating id-lists. The first step is to find the 1-frequent sequences. Support is calculated for each
element of the alphabet, counting the entries in the vertical formatted table that contains it. Those
entries will be included in its id-list. Subsequently, only items that reach the minimum support are
frequent and will be considered for finding 2-frequent sequences. In the general case, candidate
ksequences are found by joining the id-lists of any two frequent (k-1)-sequences, that have the same
SID and have ordered sequential positions (EIDs). The algorithm stops when no more frequent
sequences have been found or no more candidate sequences have been constructed.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.2. PrefixSpan</title>
      <p>PrefixSpan (Prefix-Projected Sequential Patterns Mining) [15] is a Pattern-Growth-based algorithm,
because it does not generate candidate sequences, but instead uses partitioning of the data set into
projections, which will be explored separately to extend the already known frequent sequences.</p>
      <p>The PrefixSpan algorithm includes the following steps:
1. Find 1-frequent sequences in the dataset that will later be concatenated to the current frequent
sequence (or the current frequent prefix) to form new frequent sequences. Initially, the current frequent
sequence is an empty sequence,  = 〈_〉 .
2. The search space is partitioned according to the sequences found in the previous step. For each
new, frequent sequence obtained, a projection is created, considering that sequence as a prefix.
3. For each projection, look for the elements with support at least equal to  which will be used
to extend the previous frequent sequences.</p>
      <p>These steps are repeated recursively, the algorithm operating on a divide et impera strategy.</p>
    </sec>
    <sec id="sec-4">
      <title>3. MIMIC-IV dataset</title>
      <p>
        MIMIC (Medical Information Mart for Intensive Care) is a relational database, publicly accessible
which documents the hospitalizations of patients at Beth Israel Deaconess Medical Center (BIDMC)
in Boston, MA, USA. MIMIC-IV [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is the latest version of the MIMIC database and represents an
improvement of MIMIC-III, with a modular structure and more recent patient data from 2008 to 2019.
MIMIC-IV contains five modules that reflect the origin of the data: core, hosp, icu, ed and cxr.
We used the hosp module which provides information from the electronic medical records that
include laboratory tests, medications, and diagnoses. From this module, the following tables were
used: prescriptions,diagnoses_icd and d_icd_diagnoses. The prescriptions table contains information
about the prescribed medications. The drug type field has three possible values: MAIN, BASE, or
ADDITIVE. The diagnoses_icd table records the diagnoses for which a patient was billed. Each
diagnosis has associated a seq_num whichrepresents the importance of the diagnosis. The lower the
seq_num is, the more significant the diagnosis is. The official name of a diagnosis can be identified
using the table d_ icd_ diagnoses.
      </p>
      <p>The prescriptions table contains 17008053 records, i.e., drugs that were individually prescribed. In
most prescriptions, the drug type was in the MAIN category. Prescriptions were made for 232064
patients, with 452115 hospitalizations. A distribution of the number of drugs per hospitalization is
available in Figure 1. In most cases, this number falls in the range [0,400], although there are also
much higher values (a maximum of 2156).</p>
      <p>There are 5280351 diagnoses in the associated table diagnoses_icd, established for 255106 patients
who had 521111 hospitalizations. A patient may have several hospitalizations, and for each
hospitalization, several diagnoses. The distribution of the number of diagnoses per hospitalization is
given in Figure 2.</p>
      <p>The d_icd_ diagnoses table contains 109775 lines, or possible diagnoses. Table 1 shows the ranking
of the most common diagnoses.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental results</title>
      <p>This section describes the steps followed in generating predictions using Sequential Pattern Mining
algorithms on the MIMIC-IV dataset, as shown in Figure 3. The steps are the following: finding the list of
distinct drugs, filtering hospitalizations by diagnoses, building sequences of drugs, running sequential
pattern algorithms and building rules.</p>
      <p>The cases where the predictions are relevant and the parameters that influence their accuracy are
analyzed.</p>
    </sec>
    <sec id="sec-6">
      <title>4.1. Preprocessing</title>
      <p>The same drug may appear in prescriptions in several forms, such as various abbreviations (‘hepa’,
‘hepar’, ‘hepari’, ‘heparin’), some of the letters are capitalized (‘acetaZOLAMIDE’, ‘Acetazolamide’,
‘AcetaZOLamide’), more or less spaces and special characters (’Dextromethorphan-’,
’Dextromethorphan’), additional words, such as ’pain’, ’bulk’, ’extended release’ (’vancomycin’,
’vancomycin (bulk)’). Another, more complex problem, is that medicines may appear under completely
different names, i.e. with the generic name, or with the name used by the brand. A solution to all these
inconsistencies is the usage of the gsn field, which contains one or more 6 digit Generic Sequence Number
(GSN) codes. GSN identifies a product based on its formula, dose, method of administration and
concentration and can be used to group generally equivalent products, which may differ only through the
manufacturer. In order to reduce the existence of several equivalent elements, we created a list of drugs
with a unique id associated with the help of the GSN codes. Since a drug or other equivalent drugs can be
associated with several GSN codes, groups of GSN codes will be established so that one group contains all
codes that have been mentioned together directly or indirectly. Two drugs will be considered equivalent if
at least one of their GSN codes (not necessarily identical) is found in the same group of GSN codes. Thus,
starting from a list of 16970 pairs (drug, gsn), we obtained a list of 3398 drugs with a unique id after
preprocessing.</p>
    </sec>
    <sec id="sec-7">
      <title>4.2. The construction of sequences</title>
      <p>A sequence is an ordered list of events of the form 〈 1,  2, … ,   〉 and initially the events are empty
subsets of the alphabet I. In our case, the alphabet is the set of drugs ids  = {0,1,2, … ,3397}. A sequence
corresponds to a hospitalization and is represented by the list of ids of the drugs prescribed, grouped and
sorted by time. For example, the sequence 〈(2624), (2624), (2769, 539, 1100)〉 specifies that in the case
of a hospitalization, the drug with id 2624 was prescribed first, then again, the same drug, and then followed
by a group of three drugs.</p>
      <p>We considered two cases for the generation of sequences: the sequences are built for all
hospitalizations, or only on a subset of hospitalizations. In the first case, the list of distinct
hospitalizations that have at least one prescription can beeasily found by querying the prescriptions
table. For each element of the list, the events of the corresponding sequence are considered. Given that
there are 452 115 distinct hospitalizations, the number of generated sequences is high, fact which limits
the competence of mining algorithms. Consequently, for the second case, we considered filtering the
hospitalizations after one or more diagnoses. Given a set of keywords, we will search for
hospitalizations that have diagnoses that contain all the keywords. For example, for the words ‘heart’
and ‘pneumonia’, hospitalization with the following diagnoses ‘Pneumonia due to adenovirus’,
‘Aneurysm of heart’, ‘Other and unspecified hyperlipidemia’ will be selected. In addition to this
filtering, when constructing sequences, only prescriptions with a drug type equal to MAIN will be
considered.</p>
      <p>This filtering is meant to facilitate the use of fewer resources (time and memory) by algorithms and to
obtain better results, because the selection of hospitalizations by diagnoses can increase the chance of
finding more common patterns.</p>
    </sec>
    <sec id="sec-8">
      <title>4.3. Sequence pattern mining</title>
      <p>The frequent sequences of prescribed drugs were extracted using two sequential pattern mining
algorithms, SPADE and PrefixSpan, available in the open-source Java library SPMF [16]. We run the
algorithms on an instance based on Windows 10 Pro that has an Intel(R) Core (TM) i7-8550U CPU
@ 1.80GHz processor with8 GiB of memory.</p>
      <p>SPADE cannot be applied to the entire dataset due to the additional memory the algorithm
requires to transform the sequence database into a verticalformat. SPADE is suitable to be used for a
subset of hospitalizations. The results of SPADE are given in Table 2.</p>
      <p>We considered six use cases, i.e., hospitalizations that had the following diagnoses: Heart failure,
Born in hospital, Acute kidney failure, Need for prophylactic vaccination and inoculation against viral
hepatitis, circumcision and Encounter for immunization. We selected the hospitalizations for a
diagnosis based on some terms. The chosen terms are contained in or represent the names for the most
common diagnoses. We consider diagnoses with  _ ≥ 5 because of the higher chance that they
will be the main reason for the hospitalization. For example, there are 73 different diagnoses containing
the term 'heart failure' and for which this diagnosis is important ( _ ≥ 5). One of the most
common diagnoses is Congestive heart failure, unspecified, according to Error! Reference source not
found.. The number of resulted sequences is 52086, with an average of 21,04 events. A number of 80983
frequent sequences were found by applying the algorithm on the sequences. The selected values for
the minimum support are specified in Table 2. The value of  is empirically chosen for practical
time limits. A lower number of events means that fewer medications are prescribed for those
hospitalizations, which allows the choice of a lower minimum support.</p>
      <p>PrefixSpan The parameters of the algorithm are the value of the minimumsupport and, optionally,
the maximum length of the sequences.</p>
      <p>We tested the algorithm for all hospitalizations, using a minimum supportof 0.025 (10891
sequences) and a maximum length of a sequence of 20. The number of frequently found sequences is
equal to 9771. Some of the most frequently used drugs are: lactated ringers (Ringer’s Lactate
Solution), hydralazine (Hydralazine), tylenol (Paracetamol), 0.9% sodium chloride, potassium
chloride, heparin flush, etc.</p>
      <p>A small decrease in the minimum support can significantly increase the number of sequences and
thus the execution time. For example, for a minimum support of 0.02 (8713 sequences), the number
of frequent sequences increases to 20968.</p>
      <p>Next, we repeated the tests made with SPADE, but using the PrefixSpan algorithm instead.</p>
      <sec id="sec-8-1">
        <title>Frequent sequences Memory (mb) 101517 44601</title>
        <p>86110
99975
268516
93614
Time (s)
319
29.59
726.23
0.77
1.68
6.4
The results are given in Table 3.</p>
      </sec>
      <sec id="sec-8-2">
        <title>The results of PrefixSpan Diagnosis Heart failure Born in hospital</title>
        <p>execution time is significantly shorter for SPADE when we have long sequences, and shorter for
PrefixSpan in case of short sequences.</p>
        <p>We next analyzed the frequent sequences that resulted from the application of the algorithms. We
considered two cases: with a high minimum support (for example, hospitalizations with heart failure
diagnosis) and with a low minimum support (for example, hospitalizations with Need for prophylactic
vaccinationand inoculation against viral hepatitis diagnosis).</p>
        <p>Some frequent sequences found for hospitalizations with Need for prophylactic vaccination and
inoculation against viral hepatitis diagnosis are given in Table 4.</p>
      </sec>
      <sec id="sec-8-3">
        <title>Sequential patterns</title>
        <p>Sequential patterns
‘erythromycin ophthalmic’
’erythromycin ophthalmic’ ’phytonadione (vitamin k1)’ → ’erythromycin ophthalmic’
’hepatitis b vaccine’ ’phytonadione (vitamin k1)’
’triple dye’ ’erythromycin ophthalmic’ ’hepatitis b immune globulin’
’phytonadione (vitamin k1)’</p>
        <p>’hepatitis b vaccine’
’phytonadione (vitamin k1)’ → ’gentamicin’
’phytonadione (vitamin k1)’ → ’acetaminophen’
’lidocaine’ ’acetaminophen’ → ’hepatitis b vaccine’</p>
        <p>’triple dye’ → ’hepatitis b vaccine’
’triple dye’ ’erythromycin ophthalmic’ ’hepatitis b immune globulin’
’erythromycin ophthalmic’ ’phytonadione (vitamin k1)’→’phytonadione (vitamink1)’</p>
        <p>We analyzed the sequential patterns in order to identify the most commonly used medications. For
heart failure diagnoses, the most common drugs among the frequent sequences are: tylenol, senna
laxative, aspirin, docusate sodium, dextrose, furosemide, metoprolol tartrate, glucagon, etc. Most of
these drugs are also common in all hospitalizations, making difficult to say whether they are
specific to these types of hospitalizations or not. Consequently, we manually searched for drugs
known to be common for the treatment of heart failure2. We give next some of the results:
• from the class Angiotensin-Converting Enzyme (ACE) Inhibitors: lisinopril is often found, being
contained in over 400 sequences, captopropyl is found in 6 sequences, with support in the range
[10002200]
• from the Beta Blockers class: carvedilol appears in over 100 sequences; metoprolol is one of the
most frequently found drug
• from the class of Vasodilators: hydralazine is found in many different forms, nitroglycerin
is found in over 500 sequences
• from the class of Diuretics: furosemide is one of the most common drugs, torsemide is found in
over 500 sequences, metallozone is found only individually</p>
        <p>For patients diagnosed with Need for prophylactic vaccination and inoculation against viral
hepatitis, the drugs prescribed are less varied, most of them being hepatitis b immune globulin (bayhep
b), hepatitis b vaccine, vitamin k, gentamicin, erythromycin ophthalmic, tylenol, heparin, triple dye.
In addition to the vaccine itself (current diagnosis indicates the need for hepatitis vaccination), usual
drugs are found, or drugs specific to newborns, because the hepatitis B vaccine is administered to
them immediately after birth.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4.4. Make predictions using frequent sequences</title>
      <p>As we previously specified, the frequent sequences can be utilised to identify drugs used for
different diagnoses. But when the number of sequencesis huge, this approach becomes less relevant
and time expensive. Frequent drug sequences can reveal which drugs or combinations of drugs are
more likely to be recommended when we know the previous prescriptions. We will predict the most
likely drugs to be prescribed and compare the result with the real valuesto determine the accuracy
of the predictions.</p>
      <p>Rules construction To describe the links between the drugs from frequent sequences, we will
generate rules of form (antecedent, consequent, support) withthe following meanings:</p>
      <sec id="sec-9-1">
        <title>2 https://www.nhs.uk/conditions/heart-failure/treatment/,</title>
        <p>• For a sequence  = 〈 1,  2, … ,   〉, the antecedent will contain the first (n−1) events
〈 1,  2, … ,   −1〉, and the consequent will be the last event   . Only sequences containing at least two
elements are considered.
• The support of the rule will correspond to the sequence support.</p>
        <p>Some examples of rules generated from frequent sequences for Need for prophylactic vaccination
and inoculation against viral hepatitis diagnosis are given in Table 5.
{’erythromycin ophthalmic’,
’phytonadione (vitamin k1)’}
{’phytonadione (vitamin k1)’}
{’phytonadione (vitamin k1)’}
{’lidocaine’, ’acetaminophen’}</p>
        <p>{’triple dye’}
{’erythromycin ophthalmic’,
’phytonadione (vitamin k1)’}</p>
        <p>Consequent
{’erythromycin ophthalmic’ ’hepatitis b vaccine’
’phytonadione (vitamin k1)’}</p>
        <p>{’gentamicin’}
{’acetaminophen’}
{’hepatitis b vaccine’}
{’hepatitis b vaccine’}
{’phytonadione (vitamin k1)’}</p>
        <p>Predictions using rules Before making any predictions, the list of rules is sorted using a
multilevel approach: first, descending by the number of events from the antecedent and then descending
by support. To narrow the search space, we also created athreshold dictionary as follows: for each
length of the antecedent that exists in the previously sorted list, store the index of the first
corresponding rule. For example, the following threshold dictionary, denoted
 ℎ ℎ = {8: 0, 7: 97, 6: 1178, 5: 6174, 4: 17020, 3: 29901, 2: 39094, 1: 43006, 0: 43908} reveals
that there are eight distinct lengths of the rules’ antecedent. The rules that have an antecedent
containing x events, 1 ≤  ≤ 8, will be found in the list starting with position  ℎ ℎ [ ] and up
to position  ℎ ℎ [ − 1] − 1.</p>
        <p>Having a patient’s prescribed medication sequence during a hospitalization  = 〈 1,  2, … ,   〉
and a sorted list of rules, the predictions will be made as follows:
1. If  ≥ 1, iterate through the rules with the number of events from the antecedent equal to the number
of events in s. The threshold list will be used.
2. For each rule, check if there is a match between the antecedent and the sequence s. If a match is
found, the event from the consequent is added to a list.
3. If five matches are found, the search ends. Otherwise, the first event from the sequence s is removed
and the previous steps are repeated. Deleting the first item from s means, in fact, that we are trying to
test on the patient’s more recent history.</p>
        <p>At the end, the list of maximum five events represents the predictions of drugs for the patient with
the sequence s as history.</p>
        <p>To test the accuracy of the predictions, we used the hospitalizations for whichthe frequent sequences
were found and which, implicitly, were used to generate the rules and the dictionary of thresholds.
Denote by   the maximum value of a key in the threshold dictionary, or the maximum length of
an antecedent for the current rules. The sequences of each hospitalization are divided into segments
of length   . If they are not divided exactly, the last segment willbe considered if its length is
at least two. The last event is removedfrom each segment, as it will be used to verify the correctness
of the predictions. Predictions are made based on these segments, and if at least one of the drugs
 
contained in the predicted events is found in the event set aside, then we will consider the prediction
is correct. Accuracy is computed as the percentage of correct predictions out of the total predictions
made.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Predictions results The prediction results are given in Table 6.</title>
        <sec id="sec-9-2-1">
          <title>SPADE</title>
        </sec>
        <sec id="sec-9-2-2">
          <title>SPADE</title>
        </sec>
        <sec id="sec-9-2-3">
          <title>PrefixSpan</title>
          <p>For the heart failure diagnosis, for example, for 5718 sequences at least one correct prediction was
obtained, meaning an accuracy of 25.83%, and for 4704 we could not find any prediction. If we take into
account only the sequences on which predictions were found, then the accuracy would be 32.79%.</p>
          <p>For certain diagnostics, like Need for prophylactic vaccination and inoculation against viral hepatitis
and Circumcision the accuracy is high, while for other diagnoses like Heart failure it is small. Statistically,
the number of prescriptions increases with age [17]. Intuitively, a diagnosis that contains the term ‘born in
hospital’ refers to newborns, in which case certain standard medicines are required. The number of allowed
drugs is lower (many drugs have age restrictions). In this case, it is easier to identify which drugs are
more likely to be prescribed. The diagnoses Need for prophylactic vaccination and inoculation against
viral hepatitis and Encounter for immunization indicate that a person needs administration of a
vaccine. The person is not necessarily ill, so the number of drugs is not expected to be high. Instead,
diagnoses that contain ’heart failure’ indicate a serious, complex condition that is often found in the
population over the age of 65.</p>
          <p>To better clarify the possible reasons that affect the accuracy of predictions, we analysed other
measures detailed in Table 7. The second column contains the total number of different drugs
encountered in the sequences. The next column contains the average number of drugs per sequence.
The last column contains the average difference between the date of the last prescription and the date
of the first prescription.</p>
          <p>According to Table 7, when there is a wider range of drugs to choose from, theaccuracy tends to
decrease. The average number of drugs per sequence influences the sequential pattern mining algorithms:
it is necessary to usually choose a larger support, so as not to use too much memory, fact which also
influences theaccuracy. Another parameter that could influence the results is the length of the period
in which prescriptions were made. This may indicate complex diagnoses or, conversely, less severe
cases.</p>
          <p>The choice of the minimum support can influence the accuracy of the predictions, and indirectly
the runtime and the memory. Table 8 exemplifies theway the support influences the accuracy. The
last column is the time needed to compute the predictions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>5. Conclusions</title>
      <p>As the minimum support decreases, the accuracy increases slightly, and the runtime also increases.
Lowering the support is useful up to a certain limit, for which a reasonable execution time is obtained</p>
      <p>Sequential Pattern Mining represents an effective technique to make predictions of medications
based on the patient’s past prescription history. This paper studies in particular the application of
two algorithms, SPADE and PrefixSpan, as a means to find frequent sequences that reveal temporal
relationships between medications. The resulting frequent sequences are general or specific to one or
more diseases and are used to construct rules. Predictions are made by finding matches of a patient’s
medication history in the list of rules. According to the experimental results, there are situations in
which the predictions made can reach a satisfactory accuracy. Such a solution is especially useful for
routine cases, for instance, immunizations, or for the treatment of newborns. Instead, for more complex
diagnoses, additional study is needed to optimize the results.</p>
      <p>Some improvements that can be made are the addition and the usage of supplementary patient
information, such as laboratory results, age and supplementary medication details, like the dose, the
method of administration.</p>
    </sec>
    <sec id="sec-11">
      <title>6. References</title>
      <p>[15] Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., &amp; Hsu, M.C. Mining sequential
patterns by pattern-growth: The prefixspan approach. IEEE Transactions on knowledge and data
engineering, 16(11), 1424-1440 (2004). doi: 10.1109/TKDE.2004.77
[16] Fournier-Viger, P., Lin, J. C. W., Gomariz, A., Gueniche, T., Soltani, A., Deng,Z., &amp; Lam, H. T. The
SPMF open-source data mining library version 2. In Joint European conference on machine learning
and knowledge discovery in databases (pp. 36-40) (2016).
[17] Martin, C. B., Hales, C. M., Gu, Q., &amp; Ogden, C. L. Prescription drug use in the United States, 2015–
2016, (2019): 1-8.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Schmiedl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rottenkolber</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rottenkolber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drewelow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Thürmann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Self-medication with over-the-counter and prescribed drugs causing adverse-drugreaction-related hospital admissions: results of a prospective, long-term multi-centre study</article-title>
          .
          <source>Drug safety</source>
          (
          <year>2014</year>
          ):
          <volume>37</volume>
          (
          <issue>4</issue>
          ):
          <fpage>225</fpage>
          -
          <lpage>235</lpage>
          . doi:
          <volume>10</volume>
          .1007/s40264-014-0141-3.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Prather</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lobach</surname>
            ,
            <given-names>D. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodwin</surname>
            ,
            <given-names>L. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hales</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hage</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hammond</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <article-title>Medical data mining: knowledge discovery in a clinical data warehouse Proceedings: a conference of the American Medical Informatics Association</article-title>
          .
          <source>AMIA Fall Symposium</source>
          (
          <year>1997</year>
          ):
          <fpage>101</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Srikant</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>Mining sequential patterns</article-title>
          .
          <source>Proceedings of the Eleventh International Conference on Data Engineering</source>
          (
          <year>1995</year>
          ):
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Fournier-Viger</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J. C. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiran</surname>
            ,
            <given-names>R. U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koh</surname>
            ,
            <given-names>Y. S.</given-names>
          </string-name>
          , and Thomas,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>A survey of sequential pattern mining</article-title>
          .
          <source>Data Science and Pattern Recognition</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>54</fpage>
          -
          <lpage>77</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Norén</surname>
            ,
            <given-names>G. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bate</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hopstadius</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Star</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Edwards</surname>
            ,
            <given-names>I. R.</given-names>
          </string-name>
          <article-title>Temporal pattern discovery for trends and transient effects: its application to patient records</article-title>
          .
          <source>In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          (
          <year>2008</year>
          ):
          <fpage>963</fpage>
          -
          <lpage>971</lpage>
          . doi:
          <volume>10</volume>
          .1145/1401890.1402005.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Reps</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garibaldi</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aickelin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soria</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hubbard</surname>
          </string-name>
          , R. B.
          <article-title>Discovering sequential patterns in a UK general practice database</article-title>
          .
          <source>In Proceedings of 2012 IEEE-EMBS International Conference on Biomedical and Health Informatics</source>
          (
          <year>2012</year>
          ):
          <fpage>960</fpage>
          -
          <lpage>963</lpage>
          . doi:
          <volume>10</volume>
          .1109/BHI.
          <year>2012</year>
          .
          <volume>6211748</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCoy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sittig</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The use of sequential pattern mining to predict next prescribed medications</article-title>
          .
          <source>Journal of biomedical informatics 53</source>
          (
          <year>2015</year>
          ):
          <fpage>73</fpage>
          -
          <lpage>80</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.jbi.
          <year>2014</year>
          .
          <volume>09</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Helgason</surname>
            ,
            <given-names>́I.S.</given-names>
          </string-name>
          <article-title>Predicting prescription patterns (Doctoral dissertation</article-title>
          , Massachusetts Institute of Technology) (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Thibault</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lebel</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>An application of machine learning to assist medication order review by pharmacists in a health care center</article-title>
          . (
          <year>2019</year>
          ). https://doi.org/10.1101/19013029.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>A treatment engine by predicting next-period prescriptions</article-title>
          .
          <source>In Proceedings of the 24th ACM SIGKDD Inter-national Conference on Knowledge Discovery &amp; Data Mining</source>
          , (
          <year>2018</year>
          ):
          <fpage>1608</fpage>
          -
          <lpage>1616</lpage>
          . doi:
          <volume>10</volume>
          .1145/3219819.3220095.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asch</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , R. B.
          <article-title>Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>24</volume>
          (
          <issue>3</issue>
          ),
          <fpage>472</fpage>
          -
          <lpage>480</lpage>
          (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .1093/jamia/ocw136
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bulgarelli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horng</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celi</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mark</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>MIMIC-IV (version 1.0)</article-title>
          .
          <source>PhysioNet</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mooney</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roddick</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          <article-title>Sequential pattern mining-approaches and algorithms</article-title>
          .
          <source>ACM Computing Surveys (CSUR)</source>
          ,
          <volume>45</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          (
          <year>2013</year>
          ). doi:
          <volume>10</volume>
          .1145/2431211.2431218.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M. J. SPADE</given-names>
          </string-name>
          :
          <article-title>An efficient algorithm for mining frequent sequences</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>42</volume>
          (
          <issue>1</issue>
          ),
          <fpage>31</fpage>
          -
          <lpage>60</lpage>
          (
          <year>2001</year>
          ). doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1007652502315</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>