seqClustR: An R Package for Sequence Clustering Aditya Sharma Playpower Labs Gujarat aditya@playpowerlabs.com ABSTRACT In this paper, we’re going to describe the core features of the R package seqClustR [14] dedicated to sequence clus- tering. Sequence clustering is a data mining technique that groups similar sequences into clusters based on their similar- ities. Sequence clustering is useful when there are unknown number of similar sequences that need to be identified to gain valuable insights. The main feature of this package is that it provides easy access to different algorithms such as Edit Distance with Hierarchical Clustering, Markov Model- Based Clustering, Dynamic Time Warping, and K-Means to perform sequence clustering. We find that different algo- rithms can create very different clusters and lead us to very different conclusions. So to get a reliable understanding of the sequences, we need to apply various sequence cluster- ing algorithms and explore the data from multiple points of view. This paper illustrates how you could create different clusters from different algorithms, extract event log data for each cluster and visualize them. We have provided an ex- ample in section 4 showing step by step procedure to run sequence clustering on the National Assessment of Educa- tional Progress (NAEP) dataset. Keywords Sequence Clustering, Sequence Data, Event Logs, Visualiza- tion 1. INTRODUCTION In this paper, we’ll present the seqClustR package, which implements different clustering algorithms on sequence data in R. Sequence data contains information about the differ- ent activities performed over time. Working with sequence data can be hard sometimes; but it can probably provide better insights by making use of the temporal dimension [1]. When sequence data contains a lot of activities which lead to different types of sequences, the process model on the complete data can become too complex to interpret. To get better insights from the sequence data, we can cluster Figure 1: Process model on complete data Vs clustered data Copyright ©2021 for this paper by its authors. Use permitted under Cre- ative Commons License Attribution 4.0 International (CC BY 4.0) similar sequences together and analyse them; sequence clus- This package is designed to cluster sequence data. The pack- tering can provide better explainable models than a general age uses event log [3] as an input for the data. Event log model fitted on complete data [2]. are very commonly used to store the user behavior data. They indicate the sequence of actions a user takes over time, In process mining, sequence clustering plays an important along with added metadata of the event. For process analy- role by grouping similar sequences and providing helpful in- sis, event log is an essential data format. Event log contain sights, especially when we have little knowledge about var- three major attributes: case, activity, and timestamp. A ious types of processes hidden in the data. For example, case can be defined as a sequence of activities performed a business process data might have different versions of a over time, and an event log represents one or more cases. single process within it. Sequence clustering is used in dif- ferent domains: in bioinformatics to group similar biologi- Once we have the data in event log format, we can perform cal sequences; in marketing to identify different purchasing sequence clustering by passing it into one of the sequence behaviors of customers; in education to identify different clustering functions. The output of the function would be a learner behaviors. In education, while using digital con- list containing the fitted model and a data frame having the tent learners exhibit different types of behaviors: gaming case to cluster assigned mapping. To run analysis on individ- the system behavior; off-task behavior; carelessness behav- ual clusters, we need event logs for each cluster for which we ior. Performing sequence clustering on educational sequence have written a function split_event_log which takes event data can help us group similar learners together and then we logs and the clusters assigned data frame as inputs, and can compare their performance to understand which learner returns a list of event log by cluster. To visually observe behavior models have a positive impact on learners’ growth. the differences between sequence clusters and comparing different clustering algorithms we’ve used the fuzzymineR package [11] as it is readily available in R. There are some 2. SEQUENCE CLUSTERING IN EDUCA- other tools available as well for visualizing sequence data, TION like ProM and Disco. In recent years, a lot of educational data has been col- lected through digital learning platforms at an increasing 3.1 Clustering Algorithms rate. With more learners joining the digital platforms and more types of content being created for them, we are observ- 3.1.1 Edit Distance Clustering ing a lot of clickstream data. Clickstream data shows us the (seq_edit_distance_clustering) Edit Distance between two sequence of activities that learners went through while be- sequences can be defined as the least number of activities re- ing on the platform. Gaining insights from such data could quired to be added, subtracted, or substituted to convert one help us identify different learner behaviors [16, 6]; improve sequence into another. For example, with 1 addition we can the platform design based on learner’s interactions on the convert the sequence (a, b, c) into another sequence (a, b, platform [18, 9]; and provide adaptive content to individual c, d), so the Edit Distance between them is 1. learners to better support their learning [7, 8]. To gain in- sights from sequence data a few techniques have been used For clustering the sequence data with the help of Edit Dis- in the education data mining community such as Associa- tance, we chose Hierarchical Clustering Algorithm. To make tion Rule Mining [5], Sequential Pattern Mining [19], Pro- it convenient to choose the number of clusters, we plot the cess Mining [17], Graph-Based Analysis [10], and Curricu- Hierarchical Clustering tree for the users so that they can lum Pacing [12]. make an informed decision about it. Understanding a learner behavior from process models using 3.1.2 Markov Model Based Clustering sequence data can be complex with high-dimensional data. (seq_markov_clustering) Markov Model-based clustering The sequence data can contain a high number of distinct is a probabilistic model-based approach to cluster sequence activities and all the activities need not be equally important data. The Markov Model-based clustering method is similar or have enough learner data; we could perform exploratory to K-Means, where the cluster centroids in Markov Model data analysis on the data to understand which activities are are Markov transition probability matrices and the data important. Activities that don’t provide much information points are Markov transition probability matrices for the can make the models more complex and harder to interpret sequences. An entry Aab in a Markov transition probability [4]. We should try to reduce the complexity of the sequence matrix A can be interpreted as the conditional probability data by preprocessing the data before doing any sequence that a learner will do activity b after doing the activity a, data analysis. To illustrate this by an example, Figure 1 independent of any previous activities. shows three different process maps of learner action sequence data from the NAEP assessment. The complex process map at the top is made from all of the sequences in the dataset 3.1.3 Dynamic Time Warping (seq_dtw_clustering) Dynamic Time Warping is a distance-based clustering al- (N = 1009), and the two process maps below it are made gorithm that measures similarity between two temporal se- from two distinct clusters of sequences (N1 = 726, N2 = quences, which can vary in speed. In the context of edu- 283). We can observe that the complexity of the process cational data, we can say that two learners that perform maps of clustered sequences is lower than the process map a sequence of activities in the same order but vary in the of complete data; it is much easier now to analyse the process number of times each activity was done would have 0 Dy- map and make hypotheses about learner behavior. namic Time Warping distance. For example, two learners with sequences (a, b, c) and (a, b, b, c, c) would have 0 Dy- 3. seqClustR FRAMEWORK namic Time Warping distance. Dynamic Time Warping can be used to analyze any data that contain a sequence of ac- tions over time, it has also been used to cluster educational sequence data previously [15]. We’ve used Dynamic Time Warping clustering algorithm from the R package Dynamic Time Warping [13] in our seqClustR package. 10000 20000 3.1.4 K-Means Clustering (seq_kmeans_clustering) K-Means Clustering Algorithm identifies K number of centroids and allocates each data Height point to the nearest cluster to minimize the within-cluster sum of squares. In our approach, we calculate the number of times each ac- tivity was done for each case and then normalize it by the se- 0 quence length. These features were then passed to K-Means Clustering Algorithm. 4. EXAMPLE In this section, we’ll explain how practitioners can use the package to perform sequence clustering on their data. We’ll show an example of how to prepare the sequence data that can be passed to the clustering algorithms and then how we can visualize the clusters. Figure 2: Hierarchical Cluster Plot 4.1 Data We’ve used NAEP’s Process Data for Math test from 2017 For clustering, we need to convert the sequence data into which included data of 2500 learners. NAEP test is used event log format. We have used STUDENTID as Case ID, by the US government to measure learner knowledge across Observable as Activity ID, and EventTime as Timestamp. the country. This data was part of the NAEP Data Mining For the missing attributes, we can manually add them [3]: Competition 2019 whose goal was to find effective and inef- Lifecycle ID was added as complete, Activity Instance ID fective learner test-taking behaviors. The test was divided was added as row number, and Resource was added as NA. into two blocks of 30 min where Block A’s data was sup- library ( tidyverse ) posed to be used to predict learner’s performance on Block l i b r a r y ( bupaR ) B. There were 8 different question types but for the analysis, we’re only working with data from Block A and consider- e v e n t l o g <− s e q u e n c e d a t a %>% ing Multiple-Choice Questions as it had the majority of the a r r a n g e ( EventTime ) %>% questions in the test. mutate ( l i f e c y c l e i d = ’ c o m p l e t e ’ , r e s o u r c e = NA, Attributes of row num = 1 : nrow ( . ) ) %>% Variable Description Event Logs eventlog ( case id = ”learnerID ” , Unique identifier of a c t i v i t y i d = ”O b s e r v a b l e ” , STUDENTID Case ID the learner a c t i v i t y i n s t a n c e i d = ”row num” , Block of the NAEP l i f e c y c l e id = ” l i f e c y c l e id ” , Block - test, A or B timestamp = ”EventTime ” , Accession Unique question r e s o u r c e i d = ” r e s o u r c e ”) - Number identifier ItemType Type of Question - Observable learner Activity Activity ID 4.2 Clustering Metadata of learner’s After converting the sequence data into event logs format, ExtendedInfo - we can start applying clustering algorithms on it. As an Activity EventTime Event Timestamp Timestamp example, we’ve performed Edit Distance Based Hierarchical Clustering on the event log data that we prepared in Section Table 1: Columns of the NAEP Process Data with Attributes 4.1. When we pass the event log into the respective function, of Event Logs we are shown a Hierarchical Cluster Tree (Figure 2) in the graphics window, and we get a user prompt to select whether we would like to cut the tree by number of groups or by Before applying any clustering algorithm to the data, we height. On selecting one of the options we need to enter need to perform data preprocessing steps to reduce data the value corresponding to it. The output of the function complexity. Our data preprocessing steps included remov- would be a list containing the fitted model and the cluster ing activities that didn’t have enough learner events and re- assignments by case id. moving events where data for any of the event log attributes mentioned in table 1 were missing. l i b r a r y ( seqClustR ) event logs we can run multiple clustering algorithms and c l u s t e r <− s e q e d i t d i s t a n c e c l u s t e r i n g ( compare them. In this paper, we’ve shown how to prepare event l o g ) sequence data, perform sequence clustering, and visualize the clusters. In future, we would like to add more cluster- ing algorithms and add a functionality in the package to do 4.3 Visualizing Clusters qualitative comparisons of clusters as well. Visualization is a powerful tool that can help us explore common behavioral patterns exhibited by learners. We’ve 6. ACKNOWLEDGMENTS got two clusters from the event log in Section 4.2, to find out We would like to thank the organizers of the 2019 NAEP how both of them differ from each other based on common Data Mining Challenge for providing the data. We hope learner behavior patterns we’ll make visualizations for these that the seqClustR package would support and inspire more clusters (Figures 3 and 4). We’ve made process models for research on sequence data. both the clusters using fuzzymineR package [11]. 7. REFERENCES # Get e v e n t l o g s by c l u s t e r as a l i s t . [1] H.-P. Blossfeld, T. Schneider, and J. Doll. Methodological advantages of panel studies. designing e v e n t l o g 2 <− s p l i t e v e n t l o g ( the new national educational panel study (neps) in eventlog , germany. Journal for Educational Research Online, c l u s t e r $ c l u s t e r assignment ) 1(1):10–32, 2009. [2] A. Bogarı́n, C. Romero, R. Cerezo, and l i b r a r y ( fuzzymineR ) M. Sánchez-Santillán. Clustering for improving educational process mining. In Proceedings of the # P r o c e s s Model f o r C l u s t e r 1 fourth international conference on learning analytics and knowledge, pages 11–15, 2014. m e t r i c s <− mine f u z z y model ( [3] bupaR. Creating Event Logs. e v e n t l o g 2 [ [ ”1 ” ] ] ) https://www.bupar.net/creating_eventlogs.html. [4] A. H. Cairns, B. Gueni, M. Fhima, A. Cairns, v i z f u z z y model ( m e t r i c s = m e t r i c s , S. David, and N. Khelifa. Process mining in the node s i g t h r e s h o l d = 0 . 1 , education domain. International Journal on Advances edge s i g t h r e s h o l d = 0 . 3 , in Intelligent Systems, 8(1):219–232, 2015. edge s i g t o c o r r r a t i o = 1 ) [5] E. Garcı́a, C. Romero, S. Ventura, C. de Castro, and T. Calders. Association rule mining in learning 4.4 Discussion management systems. Handbook of educational data Performing sequence clustering using hierarchical clustering mining, pages 93–106, 2010. with edit distance method on the NAEP data provided us [6] C. Hansen, C. Hansen, N. Hjuler, S. Alstrup, and two clusters that exhibit different behaviors. Cluster 1 (Fig- C. Lioma. Sequence modelling for analysing student ure 3) shows that for 56% of the times learners directly click interaction with educational systems. arXiv preprint on one of the choices after they have presented a multiple arXiv:1708.04164, 2017. choice question; 32% of the times when learners are pre- [7] M. Köck and A. Paramythis. Activity sequence sented with a multiple-choice question they first use the cal- modelling and dynamic clustering for personalized culator and then click on one of the choices. The different e-learning. User Modeling and User-Adapted behaviors exhibited by learners in cluster 1 could be be- Interaction, 21(1):51–97, 2011. cause of varying item difficulties, where an easy item might [8] J. D. Lomas, J. Forlizzi, N. Poonwala, N. Patel, not require the use of the calculator but for more difficult S. Shodhan, K. Patel, K. Koedinger, and E. Brunskill. item learners might need it. Interface design optimization as a multi-armed bandit problem. In Proceedings of the 2016 CHI conference In cluster 2 (Figure 4), learners show a different behavioral on human factors in computing systems, pages pattern, they tend to go through different steps before click- 4142–4153, 2016. ing on one of the choices; 39% of the times when learners [9] R. Luckin et al. Modeling learning patterns of students are presented with a multiple-choice question they first use with a tutoring system using hidden markov models. the calculator but unlike cluster 1, they do some scratch- Artificial intelligence in education: Building technology work before clicking on one of the choices; learners tend to rich learning contexts that work, 158:238, 2007. click on different choices a significant number of times be- [10] N. Patel, C. Sellman, and D. Lomas. Mining frequent fore submitting their final answer and moving on to the next learning pathways from a large educational dataset. item; learners also use the eliminate answer choice tool sig- arXiv preprint arXiv:1705.11125, 2017. nificantly in cluster 2. [11] N. Patel and T. Shah. fuzzymineR. https://github.com/nirmalpatel/fuzzymineR. 5. CONCLUSION [12] N. Patel, A. Sharma, C. Sellman, and D. Lomas. The R package seqClustR provides the means to perform Curriculum pacing: A new approach to discover different clustering algorithms on sequence data by reduc- instructional practices in classrooms. In International ing the complexity of preparing data for each algorithm in Conference on Intelligent Tutoring Systems, pages a different way, by just converting the sequence data into 345–351. Springer, 2018. [13] A. Sarda-Espinosa. dtwclust: Time Series Clustering Along with Optimization for the Dynamic Time Warping Distance, 2019. R package version 5.5.6. [14] A. Sharma and N. Patel. seqClustR. https://github.com/aditya9352/seqClustR. [15] S. Shen and M. Chi. Clustering student sequential trajectories using dynamic time warping. International Educational Data Mining Society, 2017. [16] B. Shih, K. R. Koedinger, and R. Scheines. Discovery of student strategies using hidden markov model clustering. In the Proceedings of the 6th International Conference on Educational Data Mining. Citeseer, 2010. [17] N. Trcka, M. Pechenizkiy, and W. van der Aalst. Process mining from educational data. Handbook of educational data mining, pages 123–142, 2010. [18] O. Zaiane. Web usage mining for a better web-based learning environment. 2001. [19] M. Zhou, Y. Xu, J. C. Nesbit, and P. H. Winne. Sequential pattern analysis of learning logs: Methodology and applications. Handbook of educational data mining, 107:107–121, 2010. Figure 3: Process Model for Cluster 1 Figure 4: Process Model for Cluster 2