A Preliminary Study on Clustering Student Learning Data

                                                         Haiyun Bian
                                                         hbian@mscd.edu
                                          Department of Mathematical & Computer Sciences
                                                Metropolitan State College of Denver

                                                                   Subspace clustering was proposed as a solution to this
                            Abstract                               problem (Agrawal et al. 1998). Subspace clustering
  Clustering techniques have been used on educational data         algorithms search for compact clusters embedded within
  to find groups of students who demonstrate similar learning      subsets of features, and they have shown their effectiveness
  patterns. Many educational data are relatively small in the      in domains that have high dimensional datasets similar to
  sense that they contain less than a thousand student records.    educational data. One specific example is its application to
  At the same time, each student may participate in dozens of      microarray data analysis. Microarray datasets tend to have
  activities, and this means that these datasets are high          similar sizes as educational datasets, mostly in the range of
  dimensional. Finding meaningful clusters from these              several hundred instances (genes or students) and several
  datasets challenges traditional clustering algorithms. In this   hundred features (samples or activities). Subspace
  paper, we show a variety of ways to cluster student grade        clustering algorithms find subsets of genes that show
  sheets using various clustering and subspace clustering          similar expression levels under subsets of samples (Cheng
  algorithms. Our preliminary results suggest that each            et al. 2000; Madeira et al. 2004).
  algorithm has its own strength and weakness, and can be
  used to find clusters of different properties. We also show
                                                                   In this paper, we present some preliminary results from
  that subspace clustering is well suited to identify              applying a variety of different clustering techniques,
  meaningful patterns embedded in such data sets.                  including subspace clustering, to student grade sheets. We
                                                                   show that clustering this type of datasets can provide the
                                                                   instructor a tool to predict who are likely to fail the course
                        Introduction                               at very early stage as well as a possible explanation why
Traditional data mining projects deal with datasets                they are failing.
containing thousands or millions of records. Educational
data mining tends to deal with much smaller datasets,                                 Related Research
normally in the range of several hundred student records.
Even though a course may be offered multiple times, it is          Over the last decade, many data mining techniques have
difficult to merge all these data records because every            been applied to educational data (Bravo et al. 2009; Dekker
offering of the same course may involve different set of           et al. 2009; Merceron et al. 2009). Research has shown that
activities. On the other hand, many educational datasets           some techniques are more suitable for educational data than
are by nature high dimensional. For example, students’             others, mainly because of the inherent characteristics of the
learning in a course may be assessed by utilizing an               datasets in this domain. For example, support and
aggregation of several assignments, quizzes and tests. It is       confidence, the two commonly used interestingness
typical to have more than a dozen activities that contribute       measurements for association rules, are not suitable for
to a student’s final grade. The log data from online               pruning off association rules when applied to educational
teaching courses contain even more features describing the         data (Merceron et al. 2009). Instead, the authors have
activities participated by each individual student.                found that cosine and added value (or equivalently lift) are
                                                                   better measurements for educational data. One possible
Clustering is a very useful technique that finds groups of         reason is that educational data have much smaller number
students demonstrating similar performance patterns. Since         of instances than the market basket data. Therefore, support
the number of students for each dataset rarely goes beyond         and confidence tend to fall short in catching the real value
a thousand and the number of features tends to be                  of a good association rule in educational context.
comparable to the number of students, finding coherent
and compact clusters becomes difficult for this type of            Subspace clustering was first introduced to cluster students
data. It is difficult because the pair-wise distance between       skill sets in (Nugent et al. 2009). The authors proposed to
students using the full-dimensional space becomes                  start with a “value-hunting” scanning for each individual
indistinguishable when the number of features becomes              feature to find out all features that contain meaningful and
high. This problem is described as the curse of                    well-separated single-dimensional clusters. Those features
dimensionality, and it makes traditional clustering                that contain no good clusters were disregarded from further
algorithms, such as k-means, unsuitable to be directly             consideration. Then using all remaining features,
applied to high dimensional datasets.                              conventional clustering algorithms such as hierarchical
                                                                   clustering and k-means were applied to identify clusters
resided in higher-dimensional spaces. In their research,           parameters such as entropy in addition to density to prune
subspace is used to prune off uninteresting features before        away uninteresting subspaces ( Cheng et al. 2000).
the actual clustering process starts, and it is very similar to
a feature selection procedure.                                              Clustering Student Grade Sheets
In general, a subspace cluster is a group of similar
instances within their own subset of features. After the first     We assume that datasets are in the following format: each
subspace clustering algorithm for data mining was                  row represents one student record, and each column
proposed (Agrawal et al. 1998), many different algorithms          measures one activity that students participate in the
have been proposed. These algorithms can be classified             course. An example is shown in Table 1, where dij denotes
into two categories: partition based approaches (Agrawal           the ith student’s performance score in the j th activity. Most
et al. 1999; Agrawal et al. 2000) and grid based                   clustering and subspace clustering algorithms allow d ij to
approaches (or density-based approaches) (Agrawal et al.           take real values.
1998; Cheng et al. 2000; Kriegel et al. 2009).
Partition-based algorithms partition all instances into
                                                                                      Activity 1   ……            Activity m
mutually exclusive groups. Each group, as well as the
subset of features where this group of instances show the                Stu 1        d11          ……            d1m
greatest similarity is reported as a subspace cluster. Similar
to k-means, most algorithms in this category define an                   Stu 2        d21          ……            d2m
objective function to guide the search. The major                        ……           ……           ……            ……
difference between these algorithms and the k-means
algorithm is that the objective functions of subspace                    Stu n        dn1          ……            dnm
clustering algorithms are related to the subspaces where
each cluster resides in. Notice that in subspace clustering,                         Table 1 Dataset Format
the search is not only on a partition on the instance set, but
also on subspaces for each instance group. For example,
PROCLUS (Agrawal et al. 1999) is a variation of the k-
medoid algorithm. In PROCLUS, the number of clusters k
and the average number of dimensions of clusters are
specified before the running of the algorithm. This
algorithm also assumes that one instance can be assigned
to at most one subspace cluster or classified as an outlier,
while a feature can belong to multiple clusters. Unlike
PROCLUS that finds only axis-parallel subspace clusters,
ORCLUS (Agrawal et al. 2000) finds clusters in arbitrarily
oriented subspaces.
Grid-based (density-based) algorithms consider the data
matrix as a high dimensional grid, and the clustering
process is a search for dense regions in the grid. In
CLIQUE (Agrawal et al. 1998), each dimension is
partitioned into intervals of equal-length, and an n-                    Figure 1. Clusters on Student Activity Data
dimensional unit is the intersection of intervals from n
distinct dimensions. An instance is contained in a unit if         Figure 1 shows three different clusters and subspace
the values of all its features fall in the intervals of the unit   clusters that can be identified from the above data using
for all dimensions of the unit. A unit is dense if the fraction    different clustering and subspace clustering algorithms.
of the total instances contained in it exceeds an input            Properties of each type of cluster as well as the process to
parameter δ. CLIQUE starts the search for dense units              find it will be presented in the following subsections.
from single dimensions. Candidate of n-dimensional dense
units are generated using the downward closure property:           The example dataset
if a unit is dense in k dimensions, all its k-1 dimensional
projection units must all be dense. This downward closure          We will use the grade sheet for a computer science service
property dramatically reduces the search space. Since the          course as the example for this study. This dataset contains
number of candidate dense units grows exponentially in             30 students and 16 activities plus the final grade. The score
the highest dimensionality of the dense units, this                of each activity as well as the final grade are in the range of
algorithm becomes very inefficient when there are clusters         [0, 1]. All students whose final composite grade is below .6
in subspaces of high dimensionality. Research has been             (60%) are marked as failing the course. In this dataset, 7
done to extend CLIQUE by using adaptive units instead of           out of 30 students are marked as failed using this standard.
rigid grids (Kriegel et al. 2009), as well as to use other         Activities 1 through 12 are weekly in class labs in
                                                                   chronological order. Activities 13 and 14 are two large
projects due at mid semester and the end of the semester.            threshold. This suggests that choosing 0.6 as the
Activities 15 and 16 are mid-term and the final                      passing/failing threshold seems rather arbitrary.
examinations.                                                        Figure 2 shows the centroids of the two clusters. We can
                                                                     see that some activities are better in differentiating the two
Each individual feature shows positive covariance with the           clusters than others, such as activities 6, 8, 9, 10 and 11.
final grade variable. Several features are highly correlated         This result is consistent with the result from individual
to the final grade, such as activities 6, 8, 9, 10 and 11. The       feature’s covariance with the final grade variable,
least predictive features include activities 1, 13 and 15. It        suggesting that the clusters that were identified from the
is not surprising for us to see that the first lab (activity 1) is   algorithm may have captured some real characteristics of
not a good indicator of a student’s performance in the               the dataset.
course. But an interesting observation is that the mid-term          We have also tried k=3 to find three clusters from this
exam (activity 15) and the mid-term project (activity 13)            dataset. It resulted in cutting the failing cluster (cluster0)
are both as bad as the first lab to tell whether a student will      into two even smaller clusters, leaving cluster1 remain
pass the course or not.                                              unchanged.

Another interesting observation is that activity 6 (lab 6)           Features Full Data(30) cluster0(6) cluster1(24)
alone can predict with 100% accuracy about whether a                 =======================================
student will pass the course or not. We later found out that         Activity1      0.76    0.475       0.8313
the topic that was in that week was loops, which is                  Activity2      0.7533 0.3208       0.8615
considered challenging for most students. This suggest
                                                                     Activity3      0.7908 0.3333       0.9052
that if a student can grasp the concept of loop structure
                                                                     Activity4      0.785   0.3333      0.8979
very well, he might as well be able to pass the course as a
whole. Therefore, it would be worthwhile for the instructor          Activity5      0.815   0.3792      0.924
to spend more time and effort on this subject matter.                Activity6      0.7767 0.0708       0.9531
                                                                     Activity7      0.79    0.3083      0.9104
Feature Covariance                      Feature Covariance           Activity8      0.7983     0        0.9979
Activity 1 .5044                        Activity 9 .9075             Activity9      0.7683     0        0.9604
Activity 2 .7758                        Activity 10 .9089            Activity10   0.7308       0        0.9135
Activity 3 .6097                        Activity 11 .9067            Activity11   0.7833 0.1667         0.9375
Activity 4 .7415                        Activity 12 .8137
                                                                     Activity12   0.7667 0.1667         0.9167
Activity 5 .7125                        Activity 13 .5283
                                                                     Activity13   0.7647 0.5767         0.8117
Activity 6 .9787                        Activity 14 .8427
Activity 7 .8670                        Activity 15 .5613            Activity14   0.667     0           0.8338
Activity 8 .9065                        Activity 16 .8046            Activity15   0.7693 0.7844         0.7656
                                                                     Activity16   0.5674 0.2278         0.6523
                                                                         Figure 2. Centroids of Student Clusters (K = 2)
Student clusters
Here we focus on identifying groups of students who                  Activity clusters
demonstrate similar performances throughout the whole
course. This type of clusters can be useful for the instructor       In this section we take a different view on the same dataset.
to identify key activities that differentiate successful             Here we focus on finding groups of activities in which all
students from those who fail the course.                             students demonstrate similar performance patterns. For
                                                                     example, we may find a group of activities on which all
We have tried a wide variety of clustering algorithms’
                                                                     students demonstrate consistently high performance. This
available from Weka (Weka URL) on the example dataset,
                                                                     suggests that these activities involve relatively easy-to-
and the results show that the simple k-means algorithm
                                                                     grasp concepts. We may also find a group of activities
achieves at least comparable results as other more
                                                                     where all students show worst than average performance.
complicated algorithms in almost all cases.
                                                                     This suggests that the instructor may want to spend more
Using the simple k-means algorithm, we started with k=2,             time on these activities to cope with the difficulty.
that is, to find two clusters (Cluster0 and Cluster1) from
                                                                     To find this type of clusters, we would need to transpose
this dataset. Cluster0 contains 6 out of 7 students who
                                                                     the original data matrix as shown in Table 1 into Table 2.
actually failed the course, and cluster1 contains 24 students
among whom 23 are marked as passing the course. There
is one failing student who is clustered into cluster1. We
found out that this student’s composite final score is .58,                                          ……             Student n
                                                                                     Student 1
which lies right on the boundary of passing/failing
                                ……             dn1
                                                                clusters contain groups of students who demonstrate
 Activity 1       d11                                           similar performance throughout the whole course. Here we
 Activity 2       d12           ……             dn2              relax the constraint to allow any groups of students who
                                                                demonstrate similar learning patterns in any subsets of
 ……               ……            ……             ……               activities to become candidates for clusters.
                                ……             dnm              We will first show the results from partition-based
 Activity m       d1m
                                                                subspace clustering algorithm. We chose to use PROCLUS
                                                                because it reports clusters in axis-parallel subspaces, which
         Table 2. Transposed Dataset                            makes the final interpretation of the clusters easier. The
                                                                PROCLUS implementation is from the open source
                                                                subspace clustering package (OpenCluster URL).
Similar as above, we applied the SimpleKMeans from
                                                                Similar to K-means, PROCLUS needs a pre-determined
Weka to the transposed example dataset. We tested five k
                                                                number of clusters (k) before running. In addition, it also
values in the range of 2 to 6, and we tried to find the best
                                                                requires knowing the average subspace dimensionality (l).
value of k by looking at curve of within-group-variance as
                                                                We set k=4, and tried several values of l between 2 and 5,
a function of k. The result is shown in Figure 3. As we can
                                                                and found out that the results are highly similar for all
see, four clusters seems to be the best because the slope of
                                                                cases. For the example dataset, PROCLUS finds the
the curve reduced significantly after k=4.
                                                                following four clusters when we set k = 4 and l = 3:
Among four clusters of activities, cluster3 is the most
challenging group of activities because its cluster centroid
is consistently lower than the other three clusters. Cluster3   SC_0: [0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 ] #13 {2 5 7 8 10 13
contains three activities including activity 13, activity 15    14 15 17 21 23 27 29 }
and activity 16. Out of 30 students, 13 students showed         SC_1: [0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 ] #11 {0 1 4 12 16 19
significant lower than average performance on these three       20 24 25 26 28 }
activities. An interesting observation is that in previous      SC_2: [0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ] #4 {3 6 9 22 }
sections we have pointed out that activities 13 and 15 are
also considered as insignificant in differentiating passing     SC_3: [0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 ] #2 {11 18 }
students from failing ones. This mean that these two
activities maybe too hard to be used as criteria to predict     Each line describes one subspace cluster. For example, the
the student overall performance of the course. On the other     first subspace cluster (SC_0) lies in a subspace that
hand, activity 1, which is also considered as a bad feature     contains two features: activity 8 and activity 12. SC_0
to tell the difference between passing students from those      contains 13 students, and they are: stu2, stu5, stu7, and etc.
who failed, might be too easy to be used as a criterion for     A simple investigation shows that SC_2 and SC_3 contain
that purpose.                                                   all failing students. In SC_2, 4 out of 6 students who fail
                                                                this class have difficulty in doing activity 6 and activity 7,
                                                                and SC_3 shows that the other two failing students showed
                                                                difficulty in doing activity 2 and activities 8, 9 and 10. We
                                                                later found out the activity 6 was a lab on loop structure
                                                                and labs 8 and 9 are labs on Classes and Objects. This
                                                                suggests that the majority of the students who failed this
                                                                course started to fall behind when loops were introduced.
                                                                The other half who failed the class failed to catch up when
                                                                the concept of objected oriented programming were
                                                                introduced. Therefore, the instructor may want to spend
                                                                extra time to help students complete these three activities.
                                                                We can also see that SC_1 and SC_3 are two clusters that
                                                                are best contrasted by activities 6, 7 and 8. Since all
                                                                students in SC_1 passed the course while SC_3 students
                                                                failed the course, these three activities may be crucial for
                                                                students to pass the course.
              Figure 3. Within Group Variances                  We have also tried partition-based subspace clustering
                                                                algorithm on the sample data. Grid-based algorithms
Subspace clusters                                               produce more than a thousand subspace clusters, and the
                                                                large number of reported clusters makes the interpretation
In this section, we show that subspace clustering               of clusters very difficult. We will look into the possibility
algorithms can be used to find clusters embedded in             to prune off insignificant clusters based on domain
subspaces. In earlier section, we have shown that student       knowledge. Similar research has been done in bio-medical
data analysis, where domain knowledge is used to measure          Bravo J., Ortigosa A. Detecting of Low Performance Using
the significance of each bi-cluster.                              Production Rules, Proceedings of the Second International
                                                                  Conference on Educational Data Mining, 2009. p. 31-40
                                                                  Cheng C. H., Fu A. W.-C., and Zhang Y. Entropy-based
Comparisons between the three                                     Subspace Clustering for Mining Numerical Data,
                                                                  Proceedings of ACM SIGKDD International Conference
Student clusters represent groups of students showing             on Knowledge Discovery and Data Mining (KDD'99), pp.
similar performance patterns throughout the whole course,         84-93, 1999
while subspace clusters shows clusters of students who            Cheng Y. and Church G. M. Biclustering of Expression
demonstrate similar performances in subsets of activities.        Data, Proceedings of the Seventh International Conference
Activity clusters is helpful in finding out difficult tasks for   on Intelligent Systems for Molecular Biology (ISMB'00),
all students, while subspace clusters can identify subsets of     pp. 93-103, 2000
activities that challenge different groups of students. Since
                                                                  Dekker G. W., Pechenizkiy M., and Vleeshouwers J. M.
not all students experience the same difficulty in all
                                                                  Predicting Students Drop Out: A Case Study, Proceedings
activities, subspace clustering seems to be well suited for
                                                                  of the Second International Conference on Educational
this purpose. We can see from the example data that
                                                                  Data Mining, 2009. p. 41-50
activity 6 may be a good feature to tell why some students
failed this course, but it is not the only indicator. SC_3        Kriegel H. P., Kroger P., and Zimek A. Clustering High-
suggests that there are some students who had no problem          dimensional data: A Survey on Subspace Clustering,
finishing activity 6 but still failed the course due to their     Pattern-based Clustering, and Correlation Clustering,
unsatisfactory performance in activities 8, 9 and 10.             Transactions on Knowledge Discovery from Data (TKDD),
                                                                  Vol. 3, 2009
                                                                  Madeira S., Oliveira A. Biclustering Algorithms for
          Conclusions and Future Work                             Biological Data Analysis: a survey, IEEE Transactions on
                                                                  Computational Biology and Bioinformatics, vol. 1, pp. 24-
This paper is our first attempt to adopt a rich collection of     24, 2004
subspace clustering algorithms on educational data. Our           Merceron A., and Yacef K. Interestingness Measures for
preliminary results show that clustering and subspace             Association Rules in Educational Data, Proceedings of the
clustering techniques can be used on high dimensional             Second International Conference on Educational Data
education data to find out interesting student learning           Mining, 2009. p. 57-68
patterns. These cluster patterns are helpful for the               Nugent R. Ayers, E., Dean N. Conditional Subspace
instructor to gain insights into the different learning           Clustering of Skill Mastery: Identifying Skills that Separate
behaviors and adapt the course to accommodate various             Students, Proceedings of the Second International
students’ needs. We will test and validate all presented          Conference on Educational Data Mining, 2009. p. 101-110
clustering schemes on more educational data of larger size.
                                                                   OpenClusters:http://dme.rwthachen.de/OpenSubspace/
We will also look into the possibility of applying grid-
based subspace clustering algorithms to educational data          Weka: http:// www.cs.waikato.ac.nz/ml/weka
guided by domain knowledge.


                        References
Aggarwal C. C., Wolf J. L., Yu P. S., Procopiuc C., and
Park J. S. Fast Algorithms for Projected Clustering,
Proceedings of the 1999 ACM SIGMOD international
conference on Management of data (SIGMOD'99), pp. 61-
72, 1999
Aggarwal C. C., and Yu P. S. Finding Generalized
Projected Clusters in High Dimensional Spaces,
Proceedings of the 2000 ACM SIGMOD international
conference on Management of data (SIGMOD'00), pp. 70-
81, 2000
Agrawal R., Gehrke J., Gunopulos D., and Raghavan P.
Automatic Subspace Clustering of High Dimensional Data
for Data Mining Applications, Proceedings of the 1998
ACM SIGMOD international conference on Management
of data (SIGMOD'98), pp. 94-105, 1998