A Preliminary Study on Clustering Student Learning Data Haiyun Bian hbian@mscd.edu Department of Mathematical & Computer Sciences Metropolitan State College of Denver Subspace clustering was proposed as a solution to this Abstract problem (Agrawal et al. 1998). Subspace clustering Clustering techniques have been used on educational data algorithms search for compact clusters embedded within to find groups of students who demonstrate similar learning subsets of features, and they have shown their effectiveness patterns. Many educational data are relatively small in the in domains that have high dimensional datasets similar to sense that they contain less than a thousand student records. educational data. One specific example is its application to At the same time, each student may participate in dozens of microarray data analysis. Microarray datasets tend to have activities, and this means that these datasets are high similar sizes as educational datasets, mostly in the range of dimensional. Finding meaningful clusters from these several hundred instances (genes or students) and several datasets challenges traditional clustering algorithms. In this hundred features (samples or activities). Subspace paper, we show a variety of ways to cluster student grade clustering algorithms find subsets of genes that show sheets using various clustering and subspace clustering similar expression levels under subsets of samples (Cheng algorithms. Our preliminary results suggest that each et al. 2000; Madeira et al. 2004). algorithm has its own strength and weakness, and can be used to find clusters of different properties. We also show In this paper, we present some preliminary results from that subspace clustering is well suited to identify applying a variety of different clustering techniques, meaningful patterns embedded in such data sets. including subspace clustering, to student grade sheets. We show that clustering this type of datasets can provide the instructor a tool to predict who are likely to fail the course Introduction at very early stage as well as a possible explanation why Traditional data mining projects deal with datasets they are failing. containing thousands or millions of records. Educational data mining tends to deal with much smaller datasets, Related Research normally in the range of several hundred student records. Even though a course may be offered multiple times, it is Over the last decade, many data mining techniques have difficult to merge all these data records because every been applied to educational data (Bravo et al. 2009; Dekker offering of the same course may involve different set of et al. 2009; Merceron et al. 2009). Research has shown that activities. On the other hand, many educational datasets some techniques are more suitable for educational data than are by nature high dimensional. For example, students’ others, mainly because of the inherent characteristics of the learning in a course may be assessed by utilizing an datasets in this domain. For example, support and aggregation of several assignments, quizzes and tests. It is confidence, the two commonly used interestingness typical to have more than a dozen activities that contribute measurements for association rules, are not suitable for to a student’s final grade. The log data from online pruning off association rules when applied to educational teaching courses contain even more features describing the data (Merceron et al. 2009). Instead, the authors have activities participated by each individual student. found that cosine and added value (or equivalently lift) are better measurements for educational data. One possible Clustering is a very useful technique that finds groups of reason is that educational data have much smaller number students demonstrating similar performance patterns. Since of instances than the market basket data. Therefore, support the number of students for each dataset rarely goes beyond and confidence tend to fall short in catching the real value a thousand and the number of features tends to be of a good association rule in educational context. comparable to the number of students, finding coherent and compact clusters becomes difficult for this type of Subspace clustering was first introduced to cluster students data. It is difficult because the pair-wise distance between skill sets in (Nugent et al. 2009). The authors proposed to students using the full-dimensional space becomes start with a “value-hunting” scanning for each individual indistinguishable when the number of features becomes feature to find out all features that contain meaningful and high. This problem is described as the curse of well-separated single-dimensional clusters. Those features dimensionality, and it makes traditional clustering that contain no good clusters were disregarded from further algorithms, such as k-means, unsuitable to be directly consideration. Then using all remaining features, applied to high dimensional datasets. conventional clustering algorithms such as hierarchical clustering and k-means were applied to identify clusters resided in higher-dimensional spaces. In their research, parameters such as entropy in addition to density to prune subspace is used to prune off uninteresting features before away uninteresting subspaces ( Cheng et al. 2000). the actual clustering process starts, and it is very similar to a feature selection procedure. Clustering Student Grade Sheets In general, a subspace cluster is a group of similar instances within their own subset of features. After the first We assume that datasets are in the following format: each subspace clustering algorithm for data mining was row represents one student record, and each column proposed (Agrawal et al. 1998), many different algorithms measures one activity that students participate in the have been proposed. These algorithms can be classified course. An example is shown in Table 1, where dij denotes into two categories: partition based approaches (Agrawal the ith student’s performance score in the j th activity. Most et al. 1999; Agrawal et al. 2000) and grid based clustering and subspace clustering algorithms allow d ij to approaches (or density-based approaches) (Agrawal et al. take real values. 1998; Cheng et al. 2000; Kriegel et al. 2009). Partition-based algorithms partition all instances into Activity 1 …… Activity m mutually exclusive groups. Each group, as well as the subset of features where this group of instances show the Stu 1 d11 …… d1m greatest similarity is reported as a subspace cluster. Similar to k-means, most algorithms in this category define an Stu 2 d21 …… d2m objective function to guide the search. The major …… …… …… …… difference between these algorithms and the k-means algorithm is that the objective functions of subspace Stu n dn1 …… dnm clustering algorithms are related to the subspaces where each cluster resides in. Notice that in subspace clustering, Table 1 Dataset Format the search is not only on a partition on the instance set, but also on subspaces for each instance group. For example, PROCLUS (Agrawal et al. 1999) is a variation of the k- medoid algorithm. In PROCLUS, the number of clusters k and the average number of dimensions of clusters are specified before the running of the algorithm. This algorithm also assumes that one instance can be assigned to at most one subspace cluster or classified as an outlier, while a feature can belong to multiple clusters. Unlike PROCLUS that finds only axis-parallel subspace clusters, ORCLUS (Agrawal et al. 2000) finds clusters in arbitrarily oriented subspaces. Grid-based (density-based) algorithms consider the data matrix as a high dimensional grid, and the clustering process is a search for dense regions in the grid. In CLIQUE (Agrawal et al. 1998), each dimension is partitioned into intervals of equal-length, and an n- Figure 1. Clusters on Student Activity Data dimensional unit is the intersection of intervals from n distinct dimensions. An instance is contained in a unit if Figure 1 shows three different clusters and subspace the values of all its features fall in the intervals of the unit clusters that can be identified from the above data using for all dimensions of the unit. A unit is dense if the fraction different clustering and subspace clustering algorithms. of the total instances contained in it exceeds an input Properties of each type of cluster as well as the process to parameter δ. CLIQUE starts the search for dense units find it will be presented in the following subsections. from single dimensions. Candidate of n-dimensional dense units are generated using the downward closure property: The example dataset if a unit is dense in k dimensions, all its k-1 dimensional projection units must all be dense. This downward closure We will use the grade sheet for a computer science service property dramatically reduces the search space. Since the course as the example for this study. This dataset contains number of candidate dense units grows exponentially in 30 students and 16 activities plus the final grade. The score the highest dimensionality of the dense units, this of each activity as well as the final grade are in the range of algorithm becomes very inefficient when there are clusters [0, 1]. All students whose final composite grade is below .6 in subspaces of high dimensionality. Research has been (60%) are marked as failing the course. In this dataset, 7 done to extend CLIQUE by using adaptive units instead of out of 30 students are marked as failed using this standard. rigid grids (Kriegel et al. 2009), as well as to use other Activities 1 through 12 are weekly in class labs in chronological order. Activities 13 and 14 are two large projects due at mid semester and the end of the semester. threshold. This suggests that choosing 0.6 as the Activities 15 and 16 are mid-term and the final passing/failing threshold seems rather arbitrary. examinations. Figure 2 shows the centroids of the two clusters. We can see that some activities are better in differentiating the two Each individual feature shows positive covariance with the clusters than others, such as activities 6, 8, 9, 10 and 11. final grade variable. Several features are highly correlated This result is consistent with the result from individual to the final grade, such as activities 6, 8, 9, 10 and 11. The feature’s covariance with the final grade variable, least predictive features include activities 1, 13 and 15. It suggesting that the clusters that were identified from the is not surprising for us to see that the first lab (activity 1) is algorithm may have captured some real characteristics of not a good indicator of a student’s performance in the the dataset. course. But an interesting observation is that the mid-term We have also tried k=3 to find three clusters from this exam (activity 15) and the mid-term project (activity 13) dataset. It resulted in cutting the failing cluster (cluster0) are both as bad as the first lab to tell whether a student will into two even smaller clusters, leaving cluster1 remain pass the course or not. unchanged. Another interesting observation is that activity 6 (lab 6) Features Full Data(30) cluster0(6) cluster1(24) alone can predict with 100% accuracy about whether a ======================================= student will pass the course or not. We later found out that Activity1 0.76 0.475 0.8313 the topic that was in that week was loops, which is Activity2 0.7533 0.3208 0.8615 considered challenging for most students. This suggest Activity3 0.7908 0.3333 0.9052 that if a student can grasp the concept of loop structure Activity4 0.785 0.3333 0.8979 very well, he might as well be able to pass the course as a whole. Therefore, it would be worthwhile for the instructor Activity5 0.815 0.3792 0.924 to spend more time and effort on this subject matter. Activity6 0.7767 0.0708 0.9531 Activity7 0.79 0.3083 0.9104 Feature Covariance Feature Covariance Activity8 0.7983 0 0.9979 Activity 1 .5044 Activity 9 .9075 Activity9 0.7683 0 0.9604 Activity 2 .7758 Activity 10 .9089 Activity10 0.7308 0 0.9135 Activity 3 .6097 Activity 11 .9067 Activity11 0.7833 0.1667 0.9375 Activity 4 .7415 Activity 12 .8137 Activity12 0.7667 0.1667 0.9167 Activity 5 .7125 Activity 13 .5283 Activity13 0.7647 0.5767 0.8117 Activity 6 .9787 Activity 14 .8427 Activity 7 .8670 Activity 15 .5613 Activity14 0.667 0 0.8338 Activity 8 .9065 Activity 16 .8046 Activity15 0.7693 0.7844 0.7656 Activity16 0.5674 0.2278 0.6523 Figure 2. Centroids of Student Clusters (K = 2) Student clusters Here we focus on identifying groups of students who Activity clusters demonstrate similar performances throughout the whole course. This type of clusters can be useful for the instructor In this section we take a different view on the same dataset. to identify key activities that differentiate successful Here we focus on finding groups of activities in which all students from those who fail the course. students demonstrate similar performance patterns. For example, we may find a group of activities on which all We have tried a wide variety of clustering algorithms’ students demonstrate consistently high performance. This available from Weka (Weka URL) on the example dataset, suggests that these activities involve relatively easy-to- and the results show that the simple k-means algorithm grasp concepts. We may also find a group of activities achieves at least comparable results as other more where all students show worst than average performance. complicated algorithms in almost all cases. This suggests that the instructor may want to spend more Using the simple k-means algorithm, we started with k=2, time on these activities to cope with the difficulty. that is, to find two clusters (Cluster0 and Cluster1) from To find this type of clusters, we would need to transpose this dataset. Cluster0 contains 6 out of 7 students who the original data matrix as shown in Table 1 into Table 2. actually failed the course, and cluster1 contains 24 students among whom 23 are marked as passing the course. There is one failing student who is clustered into cluster1. We found out that this student’s composite final score is .58, …… Student n Student 1 which lies right on the boundary of passing/failing …… dn1 clusters contain groups of students who demonstrate Activity 1 d11 similar performance throughout the whole course. Here we Activity 2 d12 …… dn2 relax the constraint to allow any groups of students who demonstrate similar learning patterns in any subsets of …… …… …… …… activities to become candidates for clusters. …… dnm We will first show the results from partition-based Activity m d1m subspace clustering algorithm. We chose to use PROCLUS because it reports clusters in axis-parallel subspaces, which Table 2. Transposed Dataset makes the final interpretation of the clusters easier. The PROCLUS implementation is from the open source subspace clustering package (OpenCluster URL). Similar as above, we applied the SimpleKMeans from Similar to K-means, PROCLUS needs a pre-determined Weka to the transposed example dataset. We tested five k number of clusters (k) before running. In addition, it also values in the range of 2 to 6, and we tried to find the best requires knowing the average subspace dimensionality (l). value of k by looking at curve of within-group-variance as We set k=4, and tried several values of l between 2 and 5, a function of k. The result is shown in Figure 3. As we can and found out that the results are highly similar for all see, four clusters seems to be the best because the slope of cases. For the example dataset, PROCLUS finds the the curve reduced significantly after k=4. following four clusters when we set k = 4 and l = 3: Among four clusters of activities, cluster3 is the most challenging group of activities because its cluster centroid is consistently lower than the other three clusters. Cluster3 SC_0: [0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 ] #13 {2 5 7 8 10 13 contains three activities including activity 13, activity 15 14 15 17 21 23 27 29 } and activity 16. Out of 30 students, 13 students showed SC_1: [0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 ] #11 {0 1 4 12 16 19 significant lower than average performance on these three 20 24 25 26 28 } activities. An interesting observation is that in previous SC_2: [0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ] #4 {3 6 9 22 } sections we have pointed out that activities 13 and 15 are also considered as insignificant in differentiating passing SC_3: [0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 ] #2 {11 18 } students from failing ones. This mean that these two activities maybe too hard to be used as criteria to predict Each line describes one subspace cluster. For example, the the student overall performance of the course. On the other first subspace cluster (SC_0) lies in a subspace that hand, activity 1, which is also considered as a bad feature contains two features: activity 8 and activity 12. SC_0 to tell the difference between passing students from those contains 13 students, and they are: stu2, stu5, stu7, and etc. who failed, might be too easy to be used as a criterion for A simple investigation shows that SC_2 and SC_3 contain that purpose. all failing students. In SC_2, 4 out of 6 students who fail this class have difficulty in doing activity 6 and activity 7, and SC_3 shows that the other two failing students showed difficulty in doing activity 2 and activities 8, 9 and 10. We later found out the activity 6 was a lab on loop structure and labs 8 and 9 are labs on Classes and Objects. This suggests that the majority of the students who failed this course started to fall behind when loops were introduced. The other half who failed the class failed to catch up when the concept of objected oriented programming were introduced. Therefore, the instructor may want to spend extra time to help students complete these three activities. We can also see that SC_1 and SC_3 are two clusters that are best contrasted by activities 6, 7 and 8. Since all students in SC_1 passed the course while SC_3 students failed the course, these three activities may be crucial for students to pass the course. Figure 3. Within Group Variances We have also tried partition-based subspace clustering algorithm on the sample data. Grid-based algorithms Subspace clusters produce more than a thousand subspace clusters, and the large number of reported clusters makes the interpretation In this section, we show that subspace clustering of clusters very difficult. We will look into the possibility algorithms can be used to find clusters embedded in to prune off insignificant clusters based on domain subspaces. In earlier section, we have shown that student knowledge. Similar research has been done in bio-medical data analysis, where domain knowledge is used to measure Bravo J., Ortigosa A. Detecting of Low Performance Using the significance of each bi-cluster. Production Rules, Proceedings of the Second International Conference on Educational Data Mining, 2009. p. 31-40 Cheng C. H., Fu A. W.-C., and Zhang Y. Entropy-based Comparisons between the three Subspace Clustering for Mining Numerical Data, Proceedings of ACM SIGKDD International Conference Student clusters represent groups of students showing on Knowledge Discovery and Data Mining (KDD'99), pp. similar performance patterns throughout the whole course, 84-93, 1999 while subspace clusters shows clusters of students who Cheng Y. and Church G. M. Biclustering of Expression demonstrate similar performances in subsets of activities. Data, Proceedings of the Seventh International Conference Activity clusters is helpful in finding out difficult tasks for on Intelligent Systems for Molecular Biology (ISMB'00), all students, while subspace clusters can identify subsets of pp. 93-103, 2000 activities that challenge different groups of students. Since Dekker G. W., Pechenizkiy M., and Vleeshouwers J. M. not all students experience the same difficulty in all Predicting Students Drop Out: A Case Study, Proceedings activities, subspace clustering seems to be well suited for of the Second International Conference on Educational this purpose. We can see from the example data that Data Mining, 2009. p. 41-50 activity 6 may be a good feature to tell why some students failed this course, but it is not the only indicator. SC_3 Kriegel H. P., Kroger P., and Zimek A. Clustering High- suggests that there are some students who had no problem dimensional data: A Survey on Subspace Clustering, finishing activity 6 but still failed the course due to their Pattern-based Clustering, and Correlation Clustering, unsatisfactory performance in activities 8, 9 and 10. Transactions on Knowledge Discovery from Data (TKDD), Vol. 3, 2009 Madeira S., Oliveira A. Biclustering Algorithms for Conclusions and Future Work Biological Data Analysis: a survey, IEEE Transactions on Computational Biology and Bioinformatics, vol. 1, pp. 24- This paper is our first attempt to adopt a rich collection of 24, 2004 subspace clustering algorithms on educational data. Our Merceron A., and Yacef K. Interestingness Measures for preliminary results show that clustering and subspace Association Rules in Educational Data, Proceedings of the clustering techniques can be used on high dimensional Second International Conference on Educational Data education data to find out interesting student learning Mining, 2009. p. 57-68 patterns. These cluster patterns are helpful for the Nugent R. Ayers, E., Dean N. Conditional Subspace instructor to gain insights into the different learning Clustering of Skill Mastery: Identifying Skills that Separate behaviors and adapt the course to accommodate various Students, Proceedings of the Second International students’ needs. We will test and validate all presented Conference on Educational Data Mining, 2009. p. 101-110 clustering schemes on more educational data of larger size. OpenClusters:http://dme.rwthachen.de/OpenSubspace/ We will also look into the possibility of applying grid- based subspace clustering algorithms to educational data Weka: http:// www.cs.waikato.ac.nz/ml/weka guided by domain knowledge. References Aggarwal C. C., Wolf J. L., Yu P. S., Procopiuc C., and Park J. S. Fast Algorithms for Projected Clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data (SIGMOD'99), pp. 61- 72, 1999 Aggarwal C. C., and Yu P. S. Finding Generalized Projected Clusters in High Dimensional Spaces, Proceedings of the 2000 ACM SIGMOD international conference on Management of data (SIGMOD'00), pp. 70- 81, 2000 Agrawal R., Gehrke J., Gunopulos D., and Raghavan P. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data (SIGMOD'98), pp. 94-105, 1998