Student Behavioral Embeddings and Their Relationship to Outcomes in a Collaborative Online Course Renzhe Yu Zachary Pardos John Scott UC Irvine UC Berkeley UC Berkeley renzhey@uci.edu pardos@berkeley.edu jmscott212@berkeley.edu ABSTRACT approach to representing a student as a function of a co- In online collaborative learning environments, prior work has interaction network temporally formed by peers interacting found moderate success in correlating behaviors to learning in different ways in different weeks of the course. In reflec- after passing them through the lens of human knowledge tion of the prior empirical work, we test the correspondence (e.g., hand labeled content taxonomies). However, these of these representations to learning outcomes. First, we in- manual approaches may not be cost-effective for triggering vestigate if the sociality of a student, or how much she is in-time support, especially given the complexity of interper- involved in the collaborative community, can be predicted sonal and temporal behavioral patterns under rich interac- from these low-level behavioral representations, as this is a tions. In this paper, we test the hypothesis that a neu- direct goal of the special course design we analyze. Second, ral embedding of students that synthesizes their event-level given the moderate relationship between interpersonal con- course behaviors, without hand labels or knowledge about nections and learning performance in the literature, we test the specific course design, can be used to make predictions whether these vector representations are indicative of their of desired outcomes and thus inform intelligent support at final course performance. This exploration has strong ped- scale. While our student representations predicted student agogical implications because an unsupervised student-level interactivity (i.e., sociality) measures, they failed to better representation that captures signals of effective learning can predict course grades and grade improvement as compared be further deployed in intelligent systems to give just in-time to a naive baseline. We reflect on this result as a data point feedback/interventions in the face of interconnected behav- added to the nascent trend of raw student behaviors (e.g., ioral streams. clickstream) proving difficult to directly correlate to learn- ing outcomes and discuss the implications for big education data modeling. 1.1 Collaborative learning behavior and out- comes Keywords Generations of learning theories and pedagogies have high- Collaborative learning environment, neural embedding, skip- lighted the benefits of social processes for effective learning gram, online course, higher education, behavior, predictive [15, 13]. Accordingly, there has been a multitude of stud- modeling ies that characterize these processes and examine how they relate to learning outcomes from granular learning behavior 1. INTRODUCTION data [2]. One typical context of these studies is collabora- Representation of collaborative learning behaviors in their tive learning environments where students are required to raw formats has been challenging due to the complicated in- work together in one way or another. As the interpersonal ternal dependencies. Theory-driven approaches can extract and temporal dependencies complicate the social processes, some conceptually important measures of these learning pro- multiple methodological paradigms have been adopted to cesses but might not give good grounds for real-time learner represent students’ collaborative learning behavior. support due to the human effort required. In this paper, we examine an aggregate, unsupervised representation of these To model the structures of interpersonal connections, so- collaborative learning behaviors in the context of a formal cial network analysis (SNA) conceptualizes learners as nodes course that features sharing, remixing and interacting with and their various formats of interaction as edges and typi- student artifacts. We use a connectionist, neural network cally identifies global or local structures. Some studies are concentrated on the discovery of global structures such as core-periphery structures [6] and cohesive groups [3], while a number of others take more local perspectives and find the predictive power of network positions for learning outcomes [1, 5]. An alternative paradigm is the extension of psychome- tric or knowledge tracing models to collaborative settings, where collaboration status or group membership informa- tion is used to construct additional terms in the original functions [16, 9]. These adapted models have shown im- proved predictive power of students’ learning performance. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The approaches above represent students’ collaborative learn- The course lasted for 14 weeks in Spring 2016. Each week ing behaviors via theory-based or human-engineered models except for the spring break, students worked through five ac- and each captures some dimensions that are predictive of tivity phases that involved sharing, commenting, and creat- various learning outcomes. At the same time, they might ing assets and whiteboards, organized under course hashtags run the risk of misidentifying the model forms and leave that students included in their posts. These SuiteC activi- some of the behavioral signals unattended, compared to ties accounted for 25% of the final grade. whereas another more bottom-up, data-driven methods. 55% came from two long-form written papers that required students to integrate course readings. These two major as- sessments occurred around the middle and the end of the 1.2 Connectionist student representation semester, respectively. The remaining 20% of the course In domains where it is difficult to enumerate and give values grade consisted of eight ethnographic field notes authored to features that satisfactorily represent the items, distribu- by students on their site visits. tional approaches to modeling them might be useful. For ex- ample, the meaning of words in a lexicon is socially mediated We acquired all the time-stamped click events within SuiteC and does not lend themselves well to description through for this course, with a total of 684,095 entries. Each entry feature engineering. Thus, the connectionist representation recorded a granular action that a student performed on the approach, which uses neural networks to learn a continu- foregoing tools, e.g. view an asset, add element to a white- ous feature vector representing all of the contexts of a word board, etc. Attributes of the action included event type, in a corpus, has become popular [8]. Similar challenges are timestamp, associated asset/whiteboard id, anonymized user present when it comes to positioning students based on their id, user role, among others. After removing events that fell fine-grained behavior in open-ended learning environments. out of the normal period of the semester and that were not In response, recent research has attempted to apply con- triggered by a student, we kept 658,967 entries for our anal- nectionist models to learn a continuous vector of a student ysis. In addition, the gradebook which contained scores for which represents all the contexts of her raw behaviors. For the two major assessments and the final course grades was example, sequences of student responses in intelligent tutor- also available. ing systems or student actions in MOOCs are used to map students to continuous vector spaces [14, 11]. Co-enrollment sequences with other students, although not in micro-level 3. METHODS learning context, are used to represent undergraduate stu- 3.1 Student representation using the skip-gram dents throughout their degree [7]. While low-level behav- model ioral embeddings have been used in non-social contexts to In this section, we describe our methodological approach predict student performance, applying these techniques to to unsupervised student feature learning by way of neural collaborative settings may offer further insight into the com- embeddings. We model our student representation after [8], plicated social processes. who used a neural network architecture called a skip-gram to learn word representations from their context distributions 2. DATASET in a corpus. Given a word sequence {w1 , w2 , . . . , wT }, this In this study, we analyzed a fully online course offered to model maximizes the average log probability of contextual residential students at a four-year public university in the words: United States. The course was focused on sociocultural as- T 1 X X pects of literacy and global education. To facilitate collabo- log p(wt+j |wt ) (1) rative learning, the course design featured a number of activ- T t=1 −c≤j≤c,j6=0 ities related to sharing, discussing, remixing and composing where c is the contextual window size and the conditional media with peer students. These activities were enabled by probability p(wt+j |wt ) is computed using a softmax function SuiteC, a toolkit that was integrated into the Canvas learn- over all possible words in the corpus for each given wt . Be- ing management system (LMS) [4]. There were three main cause words that share meanings are more likely to occur in components of SuiteC: similar contexts, the word vectors of synonyms learned via • Asset Library is a social platform where students con- this model should be in proximity in the high-dimensional tribute and share various media content in the form space. Moreover, the learnt word vectors encode semantic of “assets,” and interact with peer assets by viewing, relationships into interesting yet simple mathematical prop- liking and commenting on them. Figure 1a shows the erties. For example,vP aris is closest to vBerlin −vGermany + gateway page of the Asset Library with the feed of vF rance . This simplicity is why we are particularly inter- recently contributed asset. ested in whether this technique can similarly characterize students from their complicated collaborative learning be- • Whiteboards is an authoring tool that allows students haviors, thus facilitating easy identification of targeted ac- to work individually or collaboratively on designing tions (e.g. pairing students that sum up to a “beacon”). multimedia artifacts. Students can import assets as In our implementation, student sequences are constructed whiteboard elements and export finished whiteboards according to their order of appearance in the raw click- as assets for peer interaction. Figure 1b illustrates the stream events sorted by time. We construct separate se- interface when students collaborate on a whiteboard. quences for each week because, with the weekly course de- sign, procrastinators for Week 1 and early birds for Week 2, • Engagement Index is a gamification tool that tracks although chronologically adjacent, may not share common and evaluates student engagement in the SuiteC tools traits. Moreover, as different event types in the raw dataset and provides a leaderboard for social comparison. may or may not represent distinct behavioral signals, we ex- (a) Gateway page of the Asset Library (b) Interface of Whiteboard collaboration Figure 1: SuiteC components periment with three approaches to grouping raw event types: Table 1: Example of student tokenization for connectionist representation (student2vec) • Raw event type grouping: There are 38 unique values (a) Raw clickstream data table in the “event” column of the raw data set, depicting Timestamp Week Event Student ID the action that a student takes (e.g. create asset com- 2/22 23:19 3 View asset 101 ment). We construct separate weekly sequences for 2/23 13:12 3 Create whiteboard 104 each of these values and feed all resulting sequences to 2/25 21:23 3 View asset 102 the model. 2/26 12:10 3 Create whiteboard 102 2/27 14:27 3 Create whiteboard 104 • Instructor coding grouping: We ask the instructor to 2/27 15:03 3 View asset 103 group the 38 events based on their perceived nature to 2/28 13:08 4 Create whiteboard 102 more accurately capture the kind of participation rep- 3/1 15:27 4 View asset 103 resented by a specific event. This process produces 15 3/2 16:04 4 Create whiteboard 101 groups, where each event belongs to one group only. 3/2 21:21 4 Create whiteboard 104 For example, when students are authoring a White- 3/3 15:23 4 View asset 101 board, 9 different events could be triggered as they 3/5 12:13 4 View asset 102 add shapes, assets, and free-hand drawing elements to their canvas, so all of nine events are categorized (b) Student sequences as input to the student2vec model as “Whiteboard Composing.” We separately construct weekly sequences for each group and feed all sequences Event × week Student ID sequence to the model. View asset, Week 3 101, 103 Create whiteboard, Week 3 104, 102, 104 • No grouping: We do not differentiate event types and View asset, Week 4 103, 101, 102 simply construct weekly sequences from the entire dataset. Create whiteboard, Week 4 102, 101, 104 Table 1 gives a generalized example of our approach. In the “raw event” approach, the “event” column contains the original event name in the dataset. For “instructor coding”, behavior as they conceptually do. Thus, we test the abil- that column is the group that the raw event belongs to. The ity of these student vectors to predict an array of human- “no grouping” approach, however, treats the columns as if engineered measures of learning. We use predictive model- filling the same value for all entries in the table. When- ing as a more formal alternative to qualitatively examining ever a student appears two more or times consecutively in algebraic properties of these vectors or looking at whether a sequence, we remove the duplicate occurrence(s). In the they exhibit meaningful clusters with respect to the learning remainder of this paper, we refer to this representation ap- measures. proach as student2vec. As for the hyperparameters of the model, we search [8, 32, 64] for the vector size and [5, 20, 40] First, we collaborate with the instructor1 and construct four for the contextual window size and plot all the results in metrics of sociality (tendency to engage in interactive activ- Section 4.2. ities) for each student: 3.2 Predicting sociality and learning outcome • median asset popularity: across all assets that a stu- dent (co-)creates throughout the semester, the median measures of their popularity values, where popularity of an asset We are interested in how well the unsupervised student2vec 1 representations capture signals of students’ social learning The instructor is the third author on this paper is defined as the unique number of non-author students we recalculate these measures only among students who re- who interact with it ceived grades, the average number of authored assets goes up to 6.5, the popularity per asset remains similar with 2.1 • total asset popularity: across all assets that a student peers, and the standard deviation of both measures shrinks (co-)creates throughout the semester, the sum of their substantially due to the removal of a large number of zero popularity values values (not reported here). • count asset authored: the total number of assets that a student (co-)creates throughout the semester • count peer asset visited: the total number of assets that a student interacts with of which she is not an author The first two variables measure popularity, or “passive” pro- cesses of socialization, while the latter two capture “active” processes. All four variables are calculated from asset-related logged events, which is a tiny fraction (∼ 5%) of all recorded activities. We further look at course grades as reflected in formal as- sessments, including the following variables: • final score: the final grade in the gradebook, out of 100 • grade gain: difference between the scores of the second (final paper) and the first (midterm paper) assessment Figure 2: Rank correlation between learning outcome mea- We then build models to predict these six measures using sures (first four rows/columns) and student interaction mea- the learned student vectors. Because the number of data sures (last four rows/columns), with statistically insignifi- points is much smaller than in typical deep learning appli- cant correlations (p > 0.1) crossed out cations, we implement two simple models: linear regression and feed-forward neural network with a single layer of 8 neurons. Each dimension of the student vector serves as 4.2 Predictive analysis a feature in the model input. As the magnitude of these We examine Spearman’s rank correlations between course vectors might correlate with the number of occurrences of performance and sociality measures. Figure 2 depicts the students and hence with sociality measures, we standardize correlation matrix in graphical terms. The three course them to unit length before feeding into the model. For each grades are moderately to highly correlated with each other, target measure, only students with valid values are included all statistically significant at the 0.1 level (upper-left quad- in the model training and testing processes. Four-fold cross- rant). The correlation between sociality and performance validations are performed for all the models and in each fold, is more complicated (lower-left quadrant). In more cases 20% of the training data is used as the validation set during the correlation is weak or insignificant, but two sociality the training process to avoid overfitting. measures (the number and the total popularity of assets au- thored) and two final outcomes (paper and course total) have moderate to high correlations. Lastly, the four social- 4. RESULTS ity measures are mostly significantly correlated with each 4.1 Descriptive analysis other, with low to moderate magnitudes (lower-right quad- A summary of the basic statistics of six variables we use rant). as prediction targets, plus the scores for two assessments is shown in Table 2. The inconsistent number of observations We illustrate the prediction performance by target variables reflect missing values in some of the variables. The last in Figure 3. In each model configuration, rooted mean squared two measures of students’ activity have valid values for all error (RMSE) is used as the evaluation metric for testing re- 114 students appearing in the dataset. Among them, 15 sults. We define a naive baseline where the mean value of students did not author any asset throughout the course the training sample is used as the predicted value in each and therefore have missing values for the two popularity fold. To evaluate the performance of a model in relation to measures. Moreover, only 79 students finished the course this baseline, we calculate the percentage of improvement with grades. from baseline: All of the three course grades average 85-90 points with a standard deviation of around 6 points (out of 100). Also, the RM SEbaseline − RM SEmodel difference between median and mean is small for all three, %∆RM SE = (2) suggesting relatively symmetric distributions. A student au- RM SEbaseline thored on average 4.8 assets a week (62.11 in total), which aligns with the weekly course requirements. Each of these Each histogram in Figure 3 depicts the %∆RM SE across 4.8 assets had around 2.4 peer visitors (150.36 in total). If different combinations of hyperparameters of student2vec, Table 2: Descriptive statistics of main variables Variable N mean std min median max midterm paper 79 88.21 6.18 72 88 100 final paper 79 85.94 6.10 65 86 98.67 final score 79 87.82 6.35 69.27 89.06 98.86 grade gain 79 -2.27 6.14 -16 -2.33 15 median asset popularity 99 1.52 1.48 0 1 10.5 total asset popularity 99 150.36 91.92 0 148 502 count asset authored 114 62.11 39.96 0 74.5 148 count peer asset visited 114 154.46 134.74 0 132 534 Table 3: Summary of prediction error (RMSE) using the best-performing student2vec representation (vector size: 8; context window size: 20; event grouping: instructor’s coding) Target Baseline Neural net (% improved) Regression (% improved) median asset popularity 1.48 1.45 (2.26) 1.50 (-1.08) total asset popularity 91.51 78.40 (14.33) 80.27 (12.28) count asset authored 34.68 27.33 (21.19) 27.57 (20.50) count peer asset visited 129.90 119.36 (8.11) 113.97 (12.26) final score 6.39 6.39 (0.06) 8.01 (-25.40) grade gain 6.12 6.41 (-4.80) 6.30 (-2.94) including vector size, contextual window size and event group- other passive learning activity information did not improve ing (each combination referred to as a “case”). This approach prediction of future assessment performance beyond what to presenting results allows for a high-level view of the pre- past assessment performance alone achieved [10]. This re- dictive power of this student representation approach. Fig- sult re-emerged in a college-level chemistry tutor setting, ure 3a suggests that student2vec has moderate predictive where past assessment performance alone predicted future power on sociality measures, especially the total amount assessment performance as well or better than if mixed with of popularity a student gains and the number of assets a detailed eye-tracking telemetry [12]. student authors where it can beat the baseline by 12% on average. By contrast, Figure 3b sees a complete failure of The analyses presented in this paper reveal similar chal- these student representations to predict learning outcomes: lenges yet some opportunity for using student clickstream in most cases the prediction performance is outweighed by from a mostly collaborative course to predict learning out- a naive baseline. These results suggest that the connection- comes. We found that our representations of students, sum- ist representation can, at least, extract low-level behavioral marized from their low-level behaviors of sharing, creating, signals that relate to social processes but not those that con- and socializing around artifacts, did correspond to human- tribute to performance. engineered sociality measures, but not to assessed perfor- mance in the course as much as a naive baseline. Given Finally, we qualitatively compare the performance of differ- our relatively low magnitude of data, an exceptionally high ent cases. Across the three event grouping approaches, in- prediction accuracy was not expected, and the results may structor coding produces similar performance to raw event, be seen as the lower bound of these representations’ predic- while both perform better in general than no grouping. To tive power. However, their null relationship with summative give an example of the best results, we select the vector size assessment results still serves as another data point suggest- of 8, context window size of 20 coupled with instructor’s cod- ing difficulty in linking raw behavior, absent of prior grade ing, and report the detailed performance metrics associated information, with assessment performance. with different prediction targets in Table 3. With regard to sociality measures, student2vec can improve the baseline On the other hand, the model was able to predict measures RMSE by 8-21%, except for median asset popularity. In of students’ interactivity above baselines, and these manu- predicting course outcomes, however, this student represen- ally engineered measures do not consistently predict course tation performs 0.06% better than the baseline at best (6.39 performance. These suggest that vector representations in vs. 6.39 for final score). general might not be the culprit. A similar methodology for representing undergraduate students also predicted on- time graduation with over 90% accuracy [7], an improvement 5. DISCUSSIONS AND CONCLUSIONS over their baseline. These mixed results nudge us to reflect Granular learning process data in online learning environ- on the roles of data-driven behavioral representations and ments afford the possibilities of real-time personalized learner theory-based feature engineering [5, 9, 17] in building use- support by way of detecting behavioral signals of unsuccess- ful predictive models of student learning (and thus, support ful learning. However, the correspondence between low-level systems) in the context of collaborative learning. It is per- student actions and their performance on assessments, out- haps not enough to learn representations of students based side of social pedagogies, has been a tenuous one, challeng- on behavior without a more careful dissection of the nature ing this possibility in the wild. In the context of an edX of the behavior. This takeaway parallels the observation in MOOC, it was found that the addition of video viewing and [2] R. Ferguson and S. B. Shum. Social Learning Analytics: Five Approaches. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, pages 23–33, Vancouver, BC, Canada, 2012. ACM Press. [3] N. Gillani and R. Eynon. Communication patterns in massively open online courses. Internet and Higher Education, 23:18–26, 2014. [4] S. M. Jayaprakash, J. M. Scott, and P. Kerschen. Connectivist Learning Using SuiteC - Create, Connect, Collaborate, Compete! In Practitioner Track Proceedings of the 7th International Learning Analytics & Knowledge Conference, pages 69–76, Vancouver, BC, Canada, 2017. [5] S. Joksimović, A. Manataki, D. Gašević, S. Dawson, V. Kovanović, and I. F. de Kereki. Translating network position into performance: Importance of centrality in different network configurations. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge, pages 314–323, Edinburgh, United Kingdom, 2016. ACM. (a) Sociality prediction targets [6] S. B. Kellogg, S. Booth, and K. M. Oliver. A Social Network Perspective on Peer Support Learning in MOOCs for Educators. International Review of Research in Open and Distance Learning, 15(5):263–289, 2014. [7] Y. Luo and Z. A. Pardos. Diagnosing University Student Subject Proficiency and Predicting Degree Completion in Vector Space. In Proceedings of the Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2018. [8] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and (b) Learning outcome prediction targets phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Figure 3: Histograms of prediction results for different tar- Information Processing Systems, pages 3111–3119. get variables using student2vec representations. Each graph Curran Associates Inc., 2013. illustrates the performances of predicting the variable in its [9] J. K. Olsen, V. Aleven, and N. Rummel. Predicting title across different combinations of model hyperparameters Student Performance In a Collaborative Learning (i.e., “cases” on the y-axis). Environment. In Proceedings of the 8th International Conference on Educational Data Mining (EDM), 2015. [10] Z. A. Pardos, Y. Bergner, D. T. Seaton, and D. E. EDM that refined knowledge component modeling is often Pritchard. Adapting Bayesian Knowledge Tracing to a necessary to accurately estimate cognitive mastery. Nev- Massive Open Online Course in edX. In Proceedings of ertheless, it was a natural expectation, in our data-driven the 6th International Conference on Educational Data approach, that similar students, in terms of when and what Mining (EDM), pages 137–144, Memphis, TN, 2013. they do, would also be similar in their course outcomes. Al- [11] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, though this turned out not to be the case in the instance L. J. Guibas, and J. Sohl-Dickstein. Deep Knowledge we examined, it remains an open question for learning sci- Tracing. In Advances in Neural Information ence researchers to consider if this is merely an anomaly or Processing Systems, pages 505–513, 2015. part of greater lesson to be learned on effective ways to fit [12] M. A. Rau and Z. Pardos. Adding eye-tracking AOI behavior into the learner process tracing picture. For our data to models of representation skills does not research, a combination of interpretable activity represen- improve prediction accuracy. In Proceedings of the 9th tation and the current embedding approach may be tested International Conference on Educational Data Mining, in the future to gain some insights into the mechanism of pages 622–623, 2016. interaction between the two in the learning process. [13] G. Siemens. Connectivism : A Learning Theory for the Digital Age. International Journal of Instructional 6. REFERENCES Technology and Distance Learning, 2(1):1–7, 2005. [1] H. Cho, G. Gay, B. Davidson, and A. Ingraffea. Social [14] M. Teruel and L. A. Alemany. Co-embeddings for networks, communication styles, and learning Student Modeling in Virtual Learning Environments. performance in a CSCL community. Computers & In Proceedings of the 26th Conference on User Education, 49(2):309–329, 2007. Modeling, Adaptation and Personalization, pages 73–80, Singapore, Singapore, 2018. ACM Press. [15] L. S. Vygotsky. Interaction between Learning and Development. In Mind in Society: Development of Higher Psychological Processes, pages 71–91. Harvard University Press, Cambridge, MA, USA, 1978. [16] M. Wilson, P. Gochyyev, and K. Scalise. Modeling Data From Collaborative Assessments: Learning in Digital Interactive Social Networks. Journal of Educational Measurement, 54(1):85–102, feb 2017. [17] D. Yang, M. Wen, and C. Rose. Weakly Supervised Role Identification in Teamwork Interactions. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1671–1680, Stroudsburg, PA, USA, 2015.