=Paper= {{Paper |id=Vol-3051/PA_3 |storemode=property |title=seqClustR: An R Package for Sequence Clustering (Short Paper) |pdfUrl=https://ceur-ws.org/Vol-3051/PA_3.pdf |volume=Vol-3051 |authors=Aditya Sharma |dblpUrl=https://dblp.org/rec/conf/edm/Sharma21 }} ==seqClustR: An R Package for Sequence Clustering (Short Paper)== https://ceur-ws.org/Vol-3051/PA_3.pdf

seqClustR: An R Package for Sequence Clustering

Aditya Sharma
Playpower Labs
Gujarat
aditya@playpowerlabs.com

ABSTRACT
In this paper, we’re going to describe the core features of
the R package seqClustR [14] dedicated to sequence clus-
tering. Sequence clustering is a data mining technique that
groups similar sequences into clusters based on their similar-
ities. Sequence clustering is useful when there are unknown
number of similar sequences that need to be identified to
gain valuable insights. The main feature of this package is
that it provides easy access to different algorithms such as
Edit Distance with Hierarchical Clustering, Markov Model-
Based Clustering, Dynamic Time Warping, and K-Means to
perform sequence clustering. We find that different algo-
rithms can create very different clusters and lead us to very
different conclusions. So to get a reliable understanding of
the sequences, we need to apply various sequence cluster-
ing algorithms and explore the data from multiple points of
view. This paper illustrates how you could create different
clusters from different algorithms, extract event log data for
each cluster and visualize them. We have provided an ex-
ample in section 4 showing step by step procedure to run
sequence clustering on the National Assessment of Educa-
tional Progress (NAEP) dataset.

Keywords
Sequence Clustering, Sequence Data, Event Logs, Visualiza-
tion

1. INTRODUCTION
In this paper, we’ll present the seqClustR package, which
implements different clustering algorithms on sequence data
in R. Sequence data contains information about the differ-
ent activities performed over time. Working with sequence
data can be hard sometimes; but it can probably provide
better insights by making use of the temporal dimension
[1]. When sequence data contains a lot of activities which
lead to different types of sequences, the process model on
the complete data can become too complex to interpret. To
get better insights from the sequence data, we can cluster

Figure 1: Process model on complete data Vs clustered data

Copyright ©2021 for this paper by its authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY 4.0)
similar sequences together and analyse them; sequence clus- This package is designed to cluster sequence data. The pack-
tering can provide better explainable models than a general age uses event log [3] as an input for the data. Event log
model fitted on complete data [2]. are very commonly used to store the user behavior data.
They indicate the sequence of actions a user takes over time,
In process mining, sequence clustering plays an important along with added metadata of the event. For process analy-
role by grouping similar sequences and providing helpful in- sis, event log is an essential data format. Event log contain
sights, especially when we have little knowledge about var- three major attributes: case, activity, and timestamp. A
ious types of processes hidden in the data. For example, case can be defined as a sequence of activities performed
a business process data might have different versions of a over time, and an event log represents one or more cases.
single process within it. Sequence clustering is used in dif-
ferent domains: in bioinformatics to group similar biologi- Once we have the data in event log format, we can perform
cal sequences; in marketing to identify different purchasing sequence clustering by passing it into one of the sequence
behaviors of customers; in education to identify different clustering functions. The output of the function would be a
learner behaviors. In education, while using digital con- list containing the fitted model and a data frame having the
tent learners exhibit different types of behaviors: gaming case to cluster assigned mapping. To run analysis on individ-
the system behavior; off-task behavior; carelessness behav- ual clusters, we need event logs for each cluster for which we
ior. Performing sequence clustering on educational sequence have written a function split_event_log which takes event
data can help us group similar learners together and then we logs and the clusters assigned data frame as inputs, and
can compare their performance to understand which learner returns a list of event log by cluster. To visually observe
behavior models have a positive impact on learners’ growth. the differences between sequence clusters and comparing
different clustering algorithms we’ve used the fuzzymineR
package [11] as it is readily available in R. There are some
2. SEQUENCE CLUSTERING IN EDUCA- other tools available as well for visualizing sequence data,
TION like ProM and Disco.
In recent years, a lot of educational data has been col-
lected through digital learning platforms at an increasing 3.1 Clustering Algorithms
rate. With more learners joining the digital platforms and
more types of content being created for them, we are observ- 3.1.1 Edit Distance Clustering
ing a lot of clickstream data. Clickstream data shows us the (seq_edit_distance_clustering) Edit Distance between two
sequence of activities that learners went through while be- sequences can be defined as the least number of activities re-
ing on the platform. Gaining insights from such data could quired to be added, subtracted, or substituted to convert one
help us identify different learner behaviors [16, 6]; improve sequence into another. For example, with 1 addition we can
the platform design based on learner’s interactions on the convert the sequence (a, b, c) into another sequence (a, b,
platform [18, 9]; and provide adaptive content to individual c, d), so the Edit Distance between them is 1.
learners to better support their learning [7, 8]. To gain in-
sights from sequence data a few techniques have been used For clustering the sequence data with the help of Edit Dis-
in the education data mining community such as Associa- tance, we chose Hierarchical Clustering Algorithm. To make
tion Rule Mining [5], Sequential Pattern Mining [19], Pro- it convenient to choose the number of clusters, we plot the
cess Mining [17], Graph-Based Analysis [10], and Curricu- Hierarchical Clustering tree for the users so that they can
lum Pacing [12]. make an informed decision about it.

Understanding a learner behavior from process models using 3.1.2 Markov Model Based Clustering
sequence data can be complex with high-dimensional data. (seq_markov_clustering) Markov Model-based clustering
The sequence data can contain a high number of distinct is a probabilistic model-based approach to cluster sequence
activities and all the activities need not be equally important data. The Markov Model-based clustering method is similar
or have enough learner data; we could perform exploratory to K-Means, where the cluster centroids in Markov Model
data analysis on the data to understand which activities are are Markov transition probability matrices and the data
important. Activities that don’t provide much information points are Markov transition probability matrices for the
can make the models more complex and harder to interpret sequences. An entry Aab in a Markov transition probability
[4]. We should try to reduce the complexity of the sequence matrix A can be interpreted as the conditional probability
data by preprocessing the data before doing any sequence that a learner will do activity b after doing the activity a,
data analysis. To illustrate this by an example, Figure 1 independent of any previous activities.
shows three different process maps of learner action sequence
data from the NAEP assessment. The complex process map
at the top is made from all of the sequences in the dataset
3.1.3 Dynamic Time Warping (seq_dtw_clustering)
Dynamic Time Warping is a distance-based clustering al-
(N = 1009), and the two process maps below it are made
gorithm that measures similarity between two temporal se-
from two distinct clusters of sequences (N1 = 726, N2 =
quences, which can vary in speed. In the context of edu-
283). We can observe that the complexity of the process
cational data, we can say that two learners that perform
maps of clustered sequences is lower than the process map
a sequence of activities in the same order but vary in the
of complete data; it is much easier now to analyse the process
number of times each activity was done would have 0 Dy-
map and make hypotheses about learner behavior.
namic Time Warping distance. For example, two learners
with sequences (a, b, c) and (a, b, b, c, c) would have 0 Dy-
3. seqClustR FRAMEWORK namic Time Warping distance. Dynamic Time Warping can
be used to analyze any data that contain a sequence of ac-
tions over time, it has also been used to cluster educational
sequence data previously [15]. We’ve used Dynamic Time
Warping clustering algorithm from the R package Dynamic
Time Warping [13] in our seqClustR package.

10000 20000
3.1.4 K-Means Clustering
(seq_kmeans_clustering) K-Means Clustering Algorithm
identifies K number of centroids and allocates each data

Height
point to the nearest cluster to minimize the within-cluster
sum of squares.

In our approach, we calculate the number of times each ac-
tivity was done for each case and then normalize it by the se-

0
quence length. These features were then passed to K-Means
Clustering Algorithm.

4. EXAMPLE
In this section, we’ll explain how practitioners can use the
package to perform sequence clustering on their data. We’ll
show an example of how to prepare the sequence data that
can be passed to the clustering algorithms and then how we
can visualize the clusters. Figure 2: Hierarchical Cluster Plot

4.1 Data
We’ve used NAEP’s Process Data for Math test from 2017 For clustering, we need to convert the sequence data into
which included data of 2500 learners. NAEP test is used event log format. We have used STUDENTID as Case ID,
by the US government to measure learner knowledge across Observable as Activity ID, and EventTime as Timestamp.
the country. This data was part of the NAEP Data Mining For the missing attributes, we can manually add them [3]:
Competition 2019 whose goal was to find effective and inef- Lifecycle ID was added as complete, Activity Instance ID
fective learner test-taking behaviors. The test was divided was added as row number, and Resource was added as NA.
into two blocks of 30 min where Block A’s data was sup-
library ( tidyverse )
posed to be used to predict learner’s performance on Block
l i b r a r y ( bupaR )
B. There were 8 different question types but for the analysis,
we’re only working with data from Block A and consider-
e v e n t l o g <− s e q u e n c e d a t a %>%
ing Multiple-Choice Questions as it had the majority of the
a r r a n g e ( EventTime ) %>%
questions in the test.
mutate ( l i f e c y c l e i d = ’ c o m p l e t e ’ ,
r e s o u r c e = NA,
Attributes of row num = 1 : nrow ( . ) ) %>%
Variable Description
Event Logs eventlog ( case id = ”learnerID ” ,
Unique identifier of a c t i v i t y i d = ”O b s e r v a b l e ” ,
STUDENTID Case ID
the learner a c t i v i t y i n s t a n c e i d = ”row num” ,
Block of the NAEP l i f e c y c l e id = ” l i f e c y c l e id ” ,
Block -
test, A or B timestamp = ”EventTime ” ,
Accession Unique question r e s o u r c e i d = ” r e s o u r c e ”)
-
Number identifier
ItemType Type of Question -
Observable learner Activity Activity ID 4.2 Clustering
Metadata of learner’s After converting the sequence data into event logs format,
ExtendedInfo - we can start applying clustering algorithms on it. As an
Activity
EventTime Event Timestamp Timestamp example, we’ve performed Edit Distance Based Hierarchical
Clustering on the event log data that we prepared in Section
Table 1: Columns of the NAEP Process Data with Attributes 4.1. When we pass the event log into the respective function,
of Event Logs we are shown a Hierarchical Cluster Tree (Figure 2) in the
graphics window, and we get a user prompt to select whether
we would like to cut the tree by number of groups or by
Before applying any clustering algorithm to the data, we
height. On selecting one of the options we need to enter
need to perform data preprocessing steps to reduce data
the value corresponding to it. The output of the function
complexity. Our data preprocessing steps included remov-
would be a list containing the fitted model and the cluster
ing activities that didn’t have enough learner events and re-
assignments by case id.
moving events where data for any of the event log attributes
mentioned in table 1 were missing. l i b r a r y ( seqClustR )
event logs we can run multiple clustering algorithms and
c l u s t e r <− s e q e d i t d i s t a n c e c l u s t e r i n g ( compare them. In this paper, we’ve shown how to prepare
event l o g ) sequence data, perform sequence clustering, and visualize
the clusters. In future, we would like to add more cluster-
ing algorithms and add a functionality in the package to do
4.3 Visualizing Clusters qualitative comparisons of clusters as well.
Visualization is a powerful tool that can help us explore
common behavioral patterns exhibited by learners. We’ve 6. ACKNOWLEDGMENTS
got two clusters from the event log in Section 4.2, to find out We would like to thank the organizers of the 2019 NAEP
how both of them differ from each other based on common Data Mining Challenge for providing the data. We hope
learner behavior patterns we’ll make visualizations for these that the seqClustR package would support and inspire more
clusters (Figures 3 and 4). We’ve made process models for research on sequence data.
both the clusters using fuzzymineR package [11].
7. REFERENCES
# Get e v e n t l o g s by c l u s t e r as a l i s t . [1] H.-P. Blossfeld, T. Schneider, and J. Doll.
Methodological advantages of panel studies. designing
e v e n t l o g 2 <− s p l i t e v e n t l o g ( the new national educational panel study (neps) in
eventlog , germany. Journal for Educational Research Online,
c l u s t e r $ c l u s t e r assignment ) 1(1):10–32, 2009.
[2] A. Bogarı́n, C. Romero, R. Cerezo, and
l i b r a r y ( fuzzymineR ) M. Sánchez-Santillán. Clustering for improving
educational process mining. In Proceedings of the
# P r o c e s s Model f o r C l u s t e r 1 fourth international conference on learning analytics
and knowledge, pages 11–15, 2014.
m e t r i c s <− mine f u z z y model ( [3] bupaR. Creating Event Logs.
e v e n t l o g 2 [ [ ”1 ” ] ] ) https://www.bupar.net/creating_eventlogs.html.
[4] A. H. Cairns, B. Gueni, M. Fhima, A. Cairns,
v i z f u z z y model ( m e t r i c s = m e t r i c s , S. David, and N. Khelifa. Process mining in the
node s i g t h r e s h o l d = 0 . 1 , education domain. International Journal on Advances
edge s i g t h r e s h o l d = 0 . 3 , in Intelligent Systems, 8(1):219–232, 2015.
edge s i g t o c o r r r a t i o = 1 )
[5] E. Garcı́a, C. Romero, S. Ventura, C. de Castro, and
T. Calders. Association rule mining in learning
4.4 Discussion management systems. Handbook of educational data
Performing sequence clustering using hierarchical clustering mining, pages 93–106, 2010.
with edit distance method on the NAEP data provided us [6] C. Hansen, C. Hansen, N. Hjuler, S. Alstrup, and
two clusters that exhibit different behaviors. Cluster 1 (Fig- C. Lioma. Sequence modelling for analysing student
ure 3) shows that for 56% of the times learners directly click interaction with educational systems. arXiv preprint
on one of the choices after they have presented a multiple arXiv:1708.04164, 2017.
choice question; 32% of the times when learners are pre- [7] M. Köck and A. Paramythis. Activity sequence
sented with a multiple-choice question they first use the cal- modelling and dynamic clustering for personalized
culator and then click on one of the choices. The different e-learning. User Modeling and User-Adapted
behaviors exhibited by learners in cluster 1 could be be- Interaction, 21(1):51–97, 2011.
cause of varying item difficulties, where an easy item might [8] J. D. Lomas, J. Forlizzi, N. Poonwala, N. Patel,
not require the use of the calculator but for more difficult S. Shodhan, K. Patel, K. Koedinger, and E. Brunskill.
item learners might need it. Interface design optimization as a multi-armed bandit
problem. In Proceedings of the 2016 CHI conference
In cluster 2 (Figure 4), learners show a different behavioral on human factors in computing systems, pages
pattern, they tend to go through different steps before click- 4142–4153, 2016.
ing on one of the choices; 39% of the times when learners [9] R. Luckin et al. Modeling learning patterns of students
are presented with a multiple-choice question they first use with a tutoring system using hidden markov models.
the calculator but unlike cluster 1, they do some scratch- Artificial intelligence in education: Building technology
work before clicking on one of the choices; learners tend to rich learning contexts that work, 158:238, 2007.
click on different choices a significant number of times be- [10] N. Patel, C. Sellman, and D. Lomas. Mining frequent
fore submitting their final answer and moving on to the next learning pathways from a large educational dataset.
item; learners also use the eliminate answer choice tool sig- arXiv preprint arXiv:1705.11125, 2017.
nificantly in cluster 2. [11] N. Patel and T. Shah. fuzzymineR.
https://github.com/nirmalpatel/fuzzymineR.
5. CONCLUSION [12] N. Patel, A. Sharma, C. Sellman, and D. Lomas.
The R package seqClustR provides the means to perform Curriculum pacing: A new approach to discover
different clustering algorithms on sequence data by reduc- instructional practices in classrooms. In International
ing the complexity of preparing data for each algorithm in Conference on Intelligent Tutoring Systems, pages
a different way, by just converting the sequence data into 345–351. Springer, 2018.
[13] A. Sarda-Espinosa. dtwclust: Time Series Clustering
Along with Optimization for the Dynamic Time
Warping Distance, 2019. R package version 5.5.6.
[14] A. Sharma and N. Patel. seqClustR.
https://github.com/aditya9352/seqClustR.
[15] S. Shen and M. Chi. Clustering student sequential
trajectories using dynamic time warping. International
Educational Data Mining Society, 2017.
[16] B. Shih, K. R. Koedinger, and R. Scheines. Discovery
of student strategies using hidden markov model
clustering. In the Proceedings of the 6th International
Conference on Educational Data Mining. Citeseer,
2010.
[17] N. Trcka, M. Pechenizkiy, and W. van der Aalst.
Process mining from educational data. Handbook of
educational data mining, pages 123–142, 2010.
[18] O. Zaiane. Web usage mining for a better web-based
learning environment. 2001.
[19] M. Zhou, Y. Xu, J. C. Nesbit, and P. H. Winne.
Sequential pattern analysis of learning logs:
Methodology and applications. Handbook of
educational data mining, 107:107–121, 2010.
Figure 3: Process Model for Cluster 1
Figure 4: Process Model for Cluster 2