seqClustR: An R Package for Sequence Clustering

                                                              Aditya Sharma
                                                                Playpower Labs
                                                                    Gujarat
                                                    aditya@playpowerlabs.com


ABSTRACT
In this paper, we’re going to describe the core features of
the R package seqClustR [14] dedicated to sequence clus-
tering. Sequence clustering is a data mining technique that
groups similar sequences into clusters based on their similar-
ities. Sequence clustering is useful when there are unknown
number of similar sequences that need to be identified to
gain valuable insights. The main feature of this package is
that it provides easy access to different algorithms such as
Edit Distance with Hierarchical Clustering, Markov Model-
Based Clustering, Dynamic Time Warping, and K-Means to
perform sequence clustering. We find that different algo-
rithms can create very different clusters and lead us to very
different conclusions. So to get a reliable understanding of
the sequences, we need to apply various sequence cluster-
ing algorithms and explore the data from multiple points of
view. This paper illustrates how you could create different
clusters from different algorithms, extract event log data for
each cluster and visualize them. We have provided an ex-
ample in section 4 showing step by step procedure to run
sequence clustering on the National Assessment of Educa-
tional Progress (NAEP) dataset.

Keywords
Sequence Clustering, Sequence Data, Event Logs, Visualiza-
tion

1.    INTRODUCTION
In this paper, we’ll present the seqClustR package, which
implements different clustering algorithms on sequence data
in R. Sequence data contains information about the differ-
ent activities performed over time. Working with sequence
data can be hard sometimes; but it can probably provide
better insights by making use of the temporal dimension
[1]. When sequence data contains a lot of activities which
lead to different types of sequences, the process model on
the complete data can become too complex to interpret. To
get better insights from the sequence data, we can cluster


                                                                          Figure 1: Process model on complete data Vs clustered data


Copyright ©2021 for this paper by its authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY 4.0)
similar sequences together and analyse them; sequence clus-       This package is designed to cluster sequence data. The pack-
tering can provide better explainable models than a general       age uses event log [3] as an input for the data. Event log
model fitted on complete data [2].                                are very commonly used to store the user behavior data.
                                                                  They indicate the sequence of actions a user takes over time,
In process mining, sequence clustering plays an important         along with added metadata of the event. For process analy-
role by grouping similar sequences and providing helpful in-      sis, event log is an essential data format. Event log contain
sights, especially when we have little knowledge about var-       three major attributes: case, activity, and timestamp. A
ious types of processes hidden in the data. For example,          case can be defined as a sequence of activities performed
a business process data might have different versions of a        over time, and an event log represents one or more cases.
single process within it. Sequence clustering is used in dif-
ferent domains: in bioinformatics to group similar biologi-       Once we have the data in event log format, we can perform
cal sequences; in marketing to identify different purchasing      sequence clustering by passing it into one of the sequence
behaviors of customers; in education to identify different        clustering functions. The output of the function would be a
learner behaviors. In education, while using digital con-         list containing the fitted model and a data frame having the
tent learners exhibit different types of behaviors: gaming        case to cluster assigned mapping. To run analysis on individ-
the system behavior; off-task behavior; carelessness behav-       ual clusters, we need event logs for each cluster for which we
ior. Performing sequence clustering on educational sequence       have written a function split_event_log which takes event
data can help us group similar learners together and then we      logs and the clusters assigned data frame as inputs, and
can compare their performance to understand which learner         returns a list of event log by cluster. To visually observe
behavior models have a positive impact on learners’ growth.       the differences between sequence clusters and comparing
                                                                  different clustering algorithms we’ve used the fuzzymineR
                                                                  package [11] as it is readily available in R. There are some
2.   SEQUENCE CLUSTERING IN EDUCA-                                other tools available as well for visualizing sequence data,
     TION                                                         like ProM and Disco.
In recent years, a lot of educational data has been col-
lected through digital learning platforms at an increasing        3.1     Clustering Algorithms
rate. With more learners joining the digital platforms and
more types of content being created for them, we are observ-      3.1.1    Edit Distance Clustering
ing a lot of clickstream data. Clickstream data shows us the      (seq_edit_distance_clustering) Edit Distance between two
sequence of activities that learners went through while be-       sequences can be defined as the least number of activities re-
ing on the platform. Gaining insights from such data could        quired to be added, subtracted, or substituted to convert one
help us identify different learner behaviors [16, 6]; improve     sequence into another. For example, with 1 addition we can
the platform design based on learner’s interactions on the        convert the sequence (a, b, c) into another sequence (a, b,
platform [18, 9]; and provide adaptive content to individual      c, d), so the Edit Distance between them is 1.
learners to better support their learning [7, 8]. To gain in-
sights from sequence data a few techniques have been used         For clustering the sequence data with the help of Edit Dis-
in the education data mining community such as Associa-           tance, we chose Hierarchical Clustering Algorithm. To make
tion Rule Mining [5], Sequential Pattern Mining [19], Pro-        it convenient to choose the number of clusters, we plot the
cess Mining [17], Graph-Based Analysis [10], and Curricu-         Hierarchical Clustering tree for the users so that they can
lum Pacing [12].                                                  make an informed decision about it.

Understanding a learner behavior from process models using        3.1.2    Markov Model Based Clustering
sequence data can be complex with high-dimensional data.          (seq_markov_clustering) Markov Model-based clustering
The sequence data can contain a high number of distinct           is a probabilistic model-based approach to cluster sequence
activities and all the activities need not be equally important   data. The Markov Model-based clustering method is similar
or have enough learner data; we could perform exploratory         to K-Means, where the cluster centroids in Markov Model
data analysis on the data to understand which activities are      are Markov transition probability matrices and the data
important. Activities that don’t provide much information         points are Markov transition probability matrices for the
can make the models more complex and harder to interpret          sequences. An entry Aab in a Markov transition probability
[4]. We should try to reduce the complexity of the sequence       matrix A can be interpreted as the conditional probability
data by preprocessing the data before doing any sequence          that a learner will do activity b after doing the activity a,
data analysis. To illustrate this by an example, Figure 1         independent of any previous activities.
shows three different process maps of learner action sequence
data from the NAEP assessment. The complex process map
at the top is made from all of the sequences in the dataset
                                                                   3.1.3 Dynamic Time Warping (seq_dtw_clustering)
                                                                  Dynamic Time Warping is a distance-based clustering al-
(N = 1009), and the two process maps below it are made
                                                                  gorithm that measures similarity between two temporal se-
from two distinct clusters of sequences (N1 = 726, N2 =
                                                                  quences, which can vary in speed. In the context of edu-
283). We can observe that the complexity of the process
                                                                  cational data, we can say that two learners that perform
maps of clustered sequences is lower than the process map
                                                                  a sequence of activities in the same order but vary in the
of complete data; it is much easier now to analyse the process
                                                                  number of times each activity was done would have 0 Dy-
map and make hypotheses about learner behavior.
                                                                  namic Time Warping distance. For example, two learners
                                                                  with sequences (a, b, c) and (a, b, b, c, c) would have 0 Dy-
3.   seqClustR FRAMEWORK                                          namic Time Warping distance. Dynamic Time Warping can
be used to analyze any data that contain a sequence of ac-
tions over time, it has also been used to cluster educational
sequence data previously [15]. We’ve used Dynamic Time
Warping clustering algorithm from the R package Dynamic
Time Warping [13] in our seqClustR package.


                                                                           10000 20000
3.1.4        K-Means Clustering
(seq_kmeans_clustering) K-Means Clustering Algorithm
identifies K number of centroids and allocates each data


                                                                  Height
point to the nearest cluster to minimize the within-cluster
sum of squares.

In our approach, we calculate the number of times each ac-
tivity was done for each case and then normalize it by the se-


                                                                           0
quence length. These features were then passed to K-Means
Clustering Algorithm.

4.     EXAMPLE
In this section, we’ll explain how practitioners can use the
package to perform sequence clustering on their data. We’ll
show an example of how to prepare the sequence data that
can be passed to the clustering algorithms and then how we
can visualize the clusters.                                                              Figure 2: Hierarchical Cluster Plot

4.1      Data
We’ve used NAEP’s Process Data for Math test from 2017           For clustering, we need to convert the sequence data into
which included data of 2500 learners. NAEP test is used          event log format. We have used STUDENTID as Case ID,
by the US government to measure learner knowledge across         Observable as Activity ID, and EventTime as Timestamp.
the country. This data was part of the NAEP Data Mining          For the missing attributes, we can manually add them [3]:
Competition 2019 whose goal was to find effective and inef-      Lifecycle ID was added as complete, Activity Instance ID
fective learner test-taking behaviors. The test was divided      was added as row number, and Resource was added as NA.
into two blocks of 30 min where Block A’s data was sup-
                                                                 library ( tidyverse )
posed to be used to predict learner’s performance on Block
                                                                 l i b r a r y ( bupaR )
B. There were 8 different question types but for the analysis,
we’re only working with data from Block A and consider-
                                                                 e v e n t l o g <− s e q u e n c e d a t a %>%
ing Multiple-Choice Questions as it had the majority of the
                                                                   a r r a n g e ( EventTime ) %>%
questions in the test.
                                                                   mutate ( l i f e c y c l e i d = ’ c o m p l e t e ’ ,
                                                                                 r e s o u r c e = NA,
                                             Attributes of                       row num = 1 : nrow ( . ) ) %>%
     Variable        Description
                                             Event Logs            eventlog ( case id = ”learnerID ” ,
                     Unique identifier of                                            a c t i v i t y i d = ”O b s e r v a b l e ” ,
     STUDENTID                               Case ID
                     the learner                                                     a c t i v i t y i n s t a n c e i d = ”row num” ,
                     Block of the NAEP                                               l i f e c y c l e id = ” l i f e c y c l e id ” ,
     Block                                   -
                     test, A or B                                                    timestamp = ”EventTime ” ,
     Accession       Unique       question                                           r e s o u r c e i d = ” r e s o u r c e ”)
                                             -
     Number          identifier
     ItemType        Type of Question        -
     Observable      learner Activity        Activity ID         4.2       Clustering
                     Metadata of learner’s                       After converting the sequence data into event logs format,
     ExtendedInfo                            -                   we can start applying clustering algorithms on it. As an
                     Activity
     EventTime       Event Timestamp         Timestamp           example, we’ve performed Edit Distance Based Hierarchical
                                                                 Clustering on the event log data that we prepared in Section
Table 1: Columns of the NAEP Process Data with Attributes        4.1. When we pass the event log into the respective function,
of Event Logs                                                    we are shown a Hierarchical Cluster Tree (Figure 2) in the
                                                                 graphics window, and we get a user prompt to select whether
                                                                 we would like to cut the tree by number of groups or by
Before applying any clustering algorithm to the data, we
                                                                 height. On selecting one of the options we need to enter
need to perform data preprocessing steps to reduce data
                                                                 the value corresponding to it. The output of the function
complexity. Our data preprocessing steps included remov-
                                                                 would be a list containing the fitted model and the cluster
ing activities that didn’t have enough learner events and re-
                                                                 assignments by case id.
moving events where data for any of the event log attributes
mentioned in table 1 were missing.                               l i b r a r y ( seqClustR )
                                                                       event logs we can run multiple clustering algorithms and
c l u s t e r <− s e q e d i t d i s t a n c e c l u s t e r i n g (   compare them. In this paper, we’ve shown how to prepare
                 event l o g )                                         sequence data, perform sequence clustering, and visualize
                                                                       the clusters. In future, we would like to add more cluster-
                                                                       ing algorithms and add a functionality in the package to do
4.3    Visualizing Clusters                                            qualitative comparisons of clusters as well.
Visualization is a powerful tool that can help us explore
common behavioral patterns exhibited by learners. We’ve                6.   ACKNOWLEDGMENTS
got two clusters from the event log in Section 4.2, to find out        We would like to thank the organizers of the 2019 NAEP
how both of them differ from each other based on common                Data Mining Challenge for providing the data. We hope
learner behavior patterns we’ll make visualizations for these          that the seqClustR package would support and inspire more
clusters (Figures 3 and 4). We’ve made process models for              research on sequence data.
both the clusters using fuzzymineR package [11].
                                                                       7.   REFERENCES
# Get e v e n t l o g s by c l u s t e r as a l i s t .                 [1] H.-P. Blossfeld, T. Schneider, and J. Doll.
                                                                            Methodological advantages of panel studies. designing
e v e n t l o g 2 <− s p l i t e v e n t l o g (                            the new national educational panel study (neps) in
                       eventlog ,                                           germany. Journal for Educational Research Online,
                       c l u s t e r $ c l u s t e r assignment )           1(1):10–32, 2009.
                                                                        [2] A. Bogarı́n, C. Romero, R. Cerezo, and
l i b r a r y ( fuzzymineR )                                                M. Sánchez-Santillán. Clustering for improving
                                                                            educational process mining. In Proceedings of the
# P r o c e s s Model f o r C l u s t e r 1                                 fourth international conference on learning analytics
                                                                            and knowledge, pages 11–15, 2014.
m e t r i c s <− mine f u z z y model (                                 [3] bupaR. Creating Event Logs.
                      e v e n t l o g 2 [ [ ”1 ” ] ] )                      https://www.bupar.net/creating_eventlogs.html.
                                                                        [4] A. H. Cairns, B. Gueni, M. Fhima, A. Cairns,
v i z f u z z y model ( m e t r i c s = m e t r i c s ,                     S. David, and N. Khelifa. Process mining in the
                        node s i g t h r e s h o l d = 0 . 1 ,              education domain. International Journal on Advances
                        edge s i g t h r e s h o l d = 0 . 3 ,              in Intelligent Systems, 8(1):219–232, 2015.
                        edge s i g t o c o r r r a t i o = 1 )
                                                                        [5] E. Garcı́a, C. Romero, S. Ventura, C. de Castro, and
                                                                            T. Calders. Association rule mining in learning
4.4    Discussion                                                           management systems. Handbook of educational data
Performing sequence clustering using hierarchical clustering                mining, pages 93–106, 2010.
with edit distance method on the NAEP data provided us                  [6] C. Hansen, C. Hansen, N. Hjuler, S. Alstrup, and
two clusters that exhibit different behaviors. Cluster 1 (Fig-              C. Lioma. Sequence modelling for analysing student
ure 3) shows that for 56% of the times learners directly click              interaction with educational systems. arXiv preprint
on one of the choices after they have presented a multiple                  arXiv:1708.04164, 2017.
choice question; 32% of the times when learners are pre-                [7] M. Köck and A. Paramythis. Activity sequence
sented with a multiple-choice question they first use the cal-              modelling and dynamic clustering for personalized
culator and then click on one of the choices. The different                 e-learning. User Modeling and User-Adapted
behaviors exhibited by learners in cluster 1 could be be-                   Interaction, 21(1):51–97, 2011.
cause of varying item difficulties, where an easy item might            [8] J. D. Lomas, J. Forlizzi, N. Poonwala, N. Patel,
not require the use of the calculator but for more difficult                S. Shodhan, K. Patel, K. Koedinger, and E. Brunskill.
item learners might need it.                                                Interface design optimization as a multi-armed bandit
                                                                            problem. In Proceedings of the 2016 CHI conference
In cluster 2 (Figure 4), learners show a different behavioral               on human factors in computing systems, pages
pattern, they tend to go through different steps before click-              4142–4153, 2016.
ing on one of the choices; 39% of the times when learners               [9] R. Luckin et al. Modeling learning patterns of students
are presented with a multiple-choice question they first use                with a tutoring system using hidden markov models.
the calculator but unlike cluster 1, they do some scratch-                  Artificial intelligence in education: Building technology
work before clicking on one of the choices; learners tend to                rich learning contexts that work, 158:238, 2007.
click on different choices a significant number of times be-           [10] N. Patel, C. Sellman, and D. Lomas. Mining frequent
fore submitting their final answer and moving on to the next                learning pathways from a large educational dataset.
item; learners also use the eliminate answer choice tool sig-               arXiv preprint arXiv:1705.11125, 2017.
nificantly in cluster 2.                                               [11] N. Patel and T. Shah. fuzzymineR.
                                                                            https://github.com/nirmalpatel/fuzzymineR.
5.    CONCLUSION                                                       [12] N. Patel, A. Sharma, C. Sellman, and D. Lomas.
The R package seqClustR provides the means to perform                       Curriculum pacing: A new approach to discover
different clustering algorithms on sequence data by reduc-                  instructional practices in classrooms. In International
ing the complexity of preparing data for each algorithm in                  Conference on Intelligent Tutoring Systems, pages
a different way, by just converting the sequence data into                  345–351. Springer, 2018.
[13] A. Sarda-Espinosa. dtwclust: Time Series Clustering
     Along with Optimization for the Dynamic Time
     Warping Distance, 2019. R package version 5.5.6.
[14] A. Sharma and N. Patel. seqClustR.
     https://github.com/aditya9352/seqClustR.
[15] S. Shen and M. Chi. Clustering student sequential
     trajectories using dynamic time warping. International
     Educational Data Mining Society, 2017.
[16] B. Shih, K. R. Koedinger, and R. Scheines. Discovery
     of student strategies using hidden markov model
     clustering. In the Proceedings of the 6th International
     Conference on Educational Data Mining. Citeseer,
     2010.
[17] N. Trcka, M. Pechenizkiy, and W. van der Aalst.
     Process mining from educational data. Handbook of
     educational data mining, pages 123–142, 2010.
[18] O. Zaiane. Web usage mining for a better web-based
     learning environment. 2001.
[19] M. Zhou, Y. Xu, J. C. Nesbit, and P. H. Winne.
     Sequential pattern analysis of learning logs:
     Methodology and applications. Handbook of
     educational data mining, 107:107–121, 2010.
Figure 3: Process Model for Cluster 1
Figure 4: Process Model for Cluster 2