Predicting Peer-Review Participation at Large Scale Using an Ensemble Learning Method Erkan Er, Eduardo Gómez-Sánchez, Miguel L. Bote-Lorenzo, Yannis Dimitriadis, Juan I. Asensio-Pérez GSIC/EMIC, Universidad de Valladolid, Valladolid, Spain. erkan@gsic.uva.es, {edugom|migbot|yannis|juaase}@tel.uva.es Abstract. Peer review has been an effective approach for the assessment of mas- sive numbers of student artefacts in MOOCs. However, low student participation is a barrier that can result in inefficiencies in the implementation of peer reviews, disrupting student learning. In this regard, knowing earlier the estimate number of peer works that students will review may bring numerous pedagogical utilities in MOOCs. Previously, we have attempted to predict student participation in peer review in a MOOC context. Building on our previous work, in this study we pro- pose an ensemble learning approach with a refined set of features. Results show that the prediction performance improves when a preceding classification model is trained to identify students with no peer-review participation and that the re- fined features were effective with more transferability to other contexts. Keywords: MOOC · Peer review · Engagement prediction · Ensemble learning 1 Introduction Peer review (or peer assessment), in which an equal-status student assesses a peer’s work [1], has been a solution to the evaluation of thousands of student artefacts (e.g., an essay) in MOOCs [2]. However, this solution itself brings some practical challenges at large scale, one of which is the low student participation [3]. Given that MOOC par- ticipants have different goals and come from diverse backgrounds, their participation in peer reviews might not be persistent [4]. With low participation rates, a peer review activity might yield various issues. For example, submissions of striving students may receive neither feedback nor a grade, which may lead to a decrease in their motivation to continue the course. Nevertheless, not many researchers have focused on student participation in peer review at large scales [3]. More research is needed to develop practical solutions for effective peer-review activities at large scale. One research line could involve the prediction of students’ participation in peer reviews. An accurate es- timation of peer-review participation can be utilized in various practical ways. For ex- ample, instructors can use this information to tune peer-review activities (e.g., incorpo- rating an adaptive time schedule for completing peer reviews based on students’ ex- pected level of participation). This information can be also used to inform the design of Copyright © 2017 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. other collaborative activities (e.g., forming groups that are inter-homogenous in terms of students’ desire to review teammates’ work). The work presented in [5] was our first attempt to predict the number of peer works a student will review by using regression methods with a large feature set. The results were promising with a reasonably low error that decreases as the course progresses and more data reflecting the student behaviour becomes available. However, the model was built with a large feature set, which may result in overfitting in MOOC contexts with fairly less students participating in peer reviews. Further, a large part of the error was accumulated on those students who submitted their assignment but did not review any peer submission. This paper addresses these limitations by building a new feature set with less yet more informative variables, and by proposing an ensemble learning model. In the following section, we describe the course data at hand and provide the details of our feature-generation approach. Next, we present the experimental study by describing the feature selection approach and the details of the ensemble method. Then, the pre- diction performance of each prediction model employed are shared. We conclude by discussing follow-up research ideas. 2 Previous Findings In our previous work [5], we obtained promising results by using regression methods to predict student participation in peer reviews in a MOOC (with 3620 enrollments) published by Canvas Network1. The feature set contained more than 80 items, including weekly cumulative features (e.g., number of discussion activities in total during whole week) as well as daily features (e.g., number of content visits per each day before the peer-review activity). There were four assignments involving submission of a learning artefact, and they were evaluated using peer reviews. Figure 1 provides the histograms along with descriptive statistics regarding the number of peer works reviewed by each student. The recommended (or required) number of peer reviews appears to be three as most students performed three peer reviews at each session. µ=2.62, SD=1.42 µ=2.56, SD=1.24 µ=2.41, SD=1.67 µ=2.46, SD=1.35 1st Peer Reviews 2nd Peer Reviews 3rd Peer Reviews 4th Peer Reviews Fig. 1. Peer review participation with mean and standard deviation scores. 1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XB2TLU The id of the course is 770000832960949. The prediction models included one of three regression methods (LASSO –least ab- solute shrinkage and selection operator, ridge, and elastic net) and the performance of each method was tested. In Table 1 shows the results of the prediction performance with the LASSO regression (which was chosen as it was the best performing method). The total mean absolute error (MAE) scores were reasonably low in general, and the performance improved considerable with the inclusion of past peer-review activities starting from the 2nd peer-review session. However, the prediction of the participation of students with no actual peer-review participation was inaccurate. This finding has a non-negligible impact on the overall error (note that around 1/6 of students who sub- mitted their assignment did not review any of their peers), suggesting a need for reduc- ing the error resulted from the disengaged students to improve the overall prediction performance. Furthermore, we found that many features were redundant, particularly those derived based on student activities on a specific day (e.g., quiz activity 2 days before the peer reviews). Therefore, the predictive model obtained was complex with many features that were particular to the context, limiting the transferability of the model to other MOOCs. Another possible problem could be the overfitting as this com- plex model were trained and tested on a small sample. The current study addresses the limitations of the previous work by studying more deeply the feature space and propos- ing an ensemble learning approach, as described in the following sections. Table 1. The MAE scores per each actual value of the peer-review participation. 0 1 2 3 4 TOTAL Peer reviews 1 2.24 1.25 0.39 0.64 1.60 1.02 Peer reviews 2 1.59 0.82 0.67 0.40 1.08 0.66 Peer reviews 3 1.18 0.90 0.77 0.32 0.93 0.56 Peer reviews 4 1.12 1.04 0.71 0.31 0.97 0.58 3 Improvements 3.1 Feature Generation Given the limitations of the features used previously, we have revised them to obtain a reduced yet predictive set that can be transferable over different peer-review sessions within the same course and that can also apply to other MOOC contexts. For this pur- pose, we mainly adopted the features proposed in [6], which are based on edX MOOCs. Given that Canvas Network MOOCs have a different database structure than edX MOOCs, we have either adopted similar features or extracted the same ones when pos- sible. The effectiveness of such features in predicting student engagement in MOOCs has been shown [7]. These features could be effective in predicting students’ peer-re- view participation as their overall course engagement is likely to be associated with their peer-review engagement [8]. Each feature was computed using the data between consecutive peer-review sessions (e.g., features for the 3rd peer reviews were calculated using the data obtained after the 2nd peer reviews) since students’ recent activities could be more relevant to their subsequent peer-review participation. Furthermore, features about learners’ activity sequences (e.g., taking a quiz followed by reading) can be powerful predictors of engagement in MOOC contexts [10]. The sequence features are about the order of student activities and can help to identify dif- ferent student profiles. Sequence features can easily scale up to thousands as activities could follow many different orders [10]. To obtain a small yet relevant set, we decided to focus on assignment, discussion and content activities and generated 2-activity length features. The complete list of features generated (n=41) is provided in Table 2. Table 2. Features extracted for the prediction of participation in peer reviews {a}_count Number of a-type requests. days_with_{a} Number of days with at least one a-type request. avgt_btw_{a} 1 Average time in minutes between a-type requests {a}_within1h 1 Number of a-type requests within a one-hour interval. uncomp_qs Number of uncomplete quiz submissions. comp_qs Number of successful quiz submissions. ttl_quizattempts Number of quiz attempts avg_quizattempts Average number of quiz attempts ttl_quiz_time Total time spent in quizzes (in minutes). avg_quiz_time Average time spent in quizzes (in minutes). avg_qs_score Average quiz scores. de_count Total number of discussion entries. de_msg_cc Average character-length of the discussion entries posted. days_with_de Number of days with at least one discussion entry. assign_score Past assignment score. pr_subms_count Number of student submissions reviewed. pr_count 2 Number of past peer reviews performed. reviews_received Number of reviews received for the previous assignment of a student. da_count 3 Number of discussion-assignment activity sequences. qa_count 3 Number of quiz-assignment activity sequences. ca_count 3 Number of content-assignment activity sequences. ad_count 3 Number of assignment-discussion activity sequences. qd_count 3 Number of quiz-discussion activity sequences. cd_count 3 Number of content-discussion activity sequences. ac_count 3 Number of assignment-content activity sequences. qc_count 3 Number of quiz-content activity sequences. dc_count 3 Number of discussion-content sequences. a denotes the type of the request (content, quiz, assignment, or discussion); 1 is also calculated combining all requests; 2 is different than pr_subms_count if students reviewed the same sub- mission multiple times; and 3 are divided by the total number of requests. 3.2 Ensemble Learning Method Ensemble learning method is a type of machine learning technique that involves the use of multiple learning algorithms to achieve higher predictive performance than what could be achieved using a single learning algorithm. Ensemble methods are found to improve predictive models in the MOOC literature [11]. The motivation for using an ensemble learning method for the current prediction task has emerged from our previ- ous work, in which we found that overall prediction performance suffers largely from poorly predicting the participation of students who have zero actual peer-review partic- ipation. Identifying such students beforehand using classification methods (i.e., non- participants vs participants) and running the regression models for only participants of peer reviews might potentially lead to higher accuracy. Therefore, to improve the pre- diction accuracy, we propose a sequential ensemble approach [12], in which a classifi- cation step is integrated prior to regression to identify those with no peer reviews ahead of time and exclude them from the regression analysis. Later, those classified as having no participation were combined with the regression predictions to evaluate the overall performance. Figure 2 depicts the ensemble method proposed. ALL DATASET STUDENTS WITH PEER REGRESSION REVIEWS CLASSIFICATION STUDENTS PEER-REVIEW WITH ZERO PARTICIPATION PEER REVIEWS PREDICTIONS OVERAL PREDICTION PERFORMANCE Fig. 2. The components of the ensemble method. 4 Experimental Study 4.1 Method First, we replicated our previous study with revised feature set. Two regression methods were tested. The first one is LASSO, which has an internal-feature selection mechanism based on L1 regularization. LASSO has been effective in previous MOOC research [13]. However, LASSO may have performance issues when features are correlated [14], which might be the case in the current study as some features were extracted from sim- ilar data. Therefore, we also used a correlation-based feature-selection (CFS) [15] to train a linear regression (LR) model. CFS focuses on the predictive ability of each fea- ture while maintaining a low correlation among them to minimize the redundancy. In the ensemble learning model, logistic regression (LGR) was chosen as the classi- fier as it was found to be more accurate compared to the others that were pilot-tested (e.g., stochastic gradient descent and decision trees). L1 regularization and CFS were also used to perform feature-selection for the classification model. While whole dataset was used to train the classification model, only data about students with at least one peer review was used to train the regression model. Only students who submitted the corresponding assignment were included in predictions since only those students could review others’ submissions. Beginning with the 2nd assignment, features of previous assignment score and peer-review participation were included in the predictions. Since the sample size was small, 10-fold cross validation method was used, and the perfor- mance was evaluated using MAE [16]. MAE was used as the metric since it provides plain interpretation of performance when target variable has a narrow range (i.e., 0-4). Also, please note that prediction scores were rounded to the closest integer value (as decimal numbers would not be practical in a real course). We used the scikit-learn im- plementations of LASSO, LGR, and LR, and WEKA implementation of CFS. 4.2 Results and Discussion The MAE scores at each actual participation level, which is 0 to 4, as well as the total MAE scores of each prediction model are provided in Table 3 and Table 4. When com- pared to the previous results (see Table 1), the performance of the regression model (see Table 3) seemed to remain almost the same with the refined list of features, with a similar trend of increasing accuracy at each subsequent prediction. The error rates were the highest at the 0-participation level. Given the likelihood of overfitting with complex models, we favour the use of the refined feature set to minimize this possibility. Also, the current feature set has the capacity to be transferred to any other week involving a peer-review prediction as well as to other MOOCs. Table 3. The MAE scores per each actual value at each peer-review session when L1 regular- ization is used for the feature-selection. 0 1 2 3 4 TOTAL 1st Peer Regression 2.06 1.08 0.24 0.75 1.68 1.04 (Std. = 0.40) Reviews Ensemble 2.06 1.08 0.24 0.76 1.68 1.04 (Std. = 0.41) 2nd Peer Regression 1.73 0.71 0.76 0.23 1.31 0.60 (Std. = 0.75) Reviews Ensemble 1.59 0.83 0.79 0.24 1.30 0.59 (Std. = 0.84) 3rd Peer Regression 1.19 0.78 0.82 0.20 0.88 0.49 (Std. = 0.94) Reviews Ensemble 0.74 1.08 1.05 0.20 1.06 0.45 (Std. = 1.12) 4th Peer Regression 1.06 1.03 0.73 0.21 0.98 0.52 (Std. = 0.99) Reviews Ensemble 0.73 1.28 0.97 0.23 0.98 0.50 (Std. = 1.16) Table 4. The MAE scores per each actual value at each peer-review session when CFS is used for the feature-selection. 0 1 2 3 4 TOTAL 1st Peer Regression 2.05 1.13 0.33 0.70 1.63 1.01 (Std. = 0.46) Reviews Ensemble 2.05 1.13 0.33 0.70 1.63 1.01 (Std. = 0.43) 2nd Peer Regression 1.68 0.66 0.71 0.28 1.17 0.60 (Std. = 0.79) Reviews Ensemble 1.45 0.97 0.74 0.25 1.32 0.58 (Std. = 0.92) 3rd Peer Regression 1.10 0.78 0.85 0.24 0.91 0.50 (Std. = 0.99) Reviews Ensemble 0.75 1.17 0.89 0.22 1.00 0.45 (Std. = 1.14) 4th Peer Regression 0.96 1.03 0.73 0.22 0.98 0.51 (Std. = 1.03) Reviews Ensemble 0.73 1.28 0.93 0.23 0.98 0.50 (Std. = 1.16) According to the results of the ensemble model in Table 3, the prediction perfor- mance has slightly increased (except the 1st peer reviews) when a classification phase is incorporated before running the regression model, compared to the performance of regression alone. That is, the classification model helped reduce the error introduced by students with zero peer-review participation. However, at the same time, it seems that the error increased in the prediction of other levels of participation due to poor classification performance. Also, no improvement was noted for the predictions at the 1st peer-review session probably because students who do and who do not contribute to peer reviews seem to have very similar profiles at this stage of the course based on the current feature set used. Further, the feature-selection methods did not appear to have different effects on the prediction performance. The results showed that the proposed ensemble method produced better predictions than that obtained using the regression method alone. This was because students with no peer-review participation were undermining the performance of the regression model, which was addressed by incorporating a classification phase to identify and ex- clude those with no participation when training the regression model. However, the overall performance did not improve considerably as the students with no peer-review participation were not classified perfectly, therefore yielding a mediocre performance at certain levels of participation. Nonetheless, given that the standard deviation of actual peer-review participation has a range of 2.41-2.62, the MAE scores achieved with the ensemble method seem to be promising, ranging from 0.45 to 1.04. Thus, the proposed predictive model holds potential to be utilized in a real MOOC context. 5 Conclusion and Future Work In this study, building on our previous work we proposed a sequential ensemble learn- ing method with a refined set of features to obtain an accurate prediction of students’ peer-review participation. The results showed that proposed ensemble model holds a potential to be further explored in future research. First, the classification model needs further attention. The reasons for its moderate performance needs to be explored and addressed accordingly using different classification approaches and more relevant fea- tures. For example, a nested ensemble approach could be utilized. Second, the ensemble method failed to improve the prediction performance at the 1st peer reviews. Possibly student profiles as identified with the current feature set was not distinctive early in the course, and therefore they offered no benefits for the classification. More distinctive features need to be identified to improve the classification performance. Nonetheless, the challenge of identifying students who will not participate in peer reviews early in the semester constitute an interesting research opportunity. Moreover, although the ap- proach used in this study demonstrates the validity of the prediction model, it is not applicable to an ongoing MOOC as the values of the target variable (which is the num- ber of peer work reviewed) would be needed to train the models. Therefore, other rele- vant training paradigms (e.g., in-situ learning) should be used to build accurate yet practical models that can be useful in continuing MOOCs [17]. 6 Acknowledgements Access to the data used in this paper was granted by Canvas Network. This work has been partially funded by research projects TIN2014-53199-C3-2-R and VA082U16, and by the Spanish network of excellence SNOLA (TIN2015-71669-REDT). References 1. Topping, K.: Peer assessment between students in colleges and universities. Rev. Educ. Res. 68, 249–276 (1998). 2. Piech, C., Huang, J., Chen, Z., Do, C., Ng, A., Koller, D.: Tuned Models of Peer Assessment in MOOCs. In: International Conference on Educational Data Mining. pp. 153–160 (2013). 3. Estevez-Ayres, I., Crespo-García, R.M., Fisteus, J.A., Delgado-Kloos, C.: An algorithm for peer review matching in Massive courses for minimising students’ frustration. J. Univers. Comput. Sci. 19, 2173–2197 (2013). 4. Suen, H.: Peer assessment for massive open online courses (MOOCs). Int. Rev. Res. Open Distrib. Learn. 15, (2014). 5. Er, E., Bote-Lorenzo, M.L., Gómez-Sánchez, E., Dimitriadis, Y., Asensio-Pérez, J.I.: Predicting Student Participation in Peer Reviews in MOOCs. In: Proceedings of the Second European MOOCs Stakeholder Summit 2017. , Madrid (2017). 6. Veeramachaneni, K., O’Reilly, U.-M., Taylor, C.: Towards Feature Engineering at Scale for Data from Massive Open Online Courses. arXiv:1407.5238v1. 6, (2014). 7. Jayaprasad, S., Jayaprasad, S.: Transfer Learning for Predictive Models in Massive Open Online Courses. Artif. Intell. 1–12 (2015). 8. Tseng, S.-F., Tsao, Y.-W., Yu, L.-C., Chan, C.-L., Lai, K.R.: Who will pass? Analyzing learner behaviors in MOOCs. Res. Pract. Technol. Enhanc. Learn. 11, 1–11 (2016). 9. Crossley, S., Paquette, L., Dascalu, M., McNamara, D.S., Baker, R.S.: Combining click- stream data with NLP tools to better understand MOOC completion. Proc. Sixth Int. Conf. Learn. Anal. Knowl. - LAK ’16. 6–14 (2016). 10. Brooks, C., Thompson, C., Teasley, S.: A time series interaction analysis method for building predictive models of learners using log data. Proc. Fifth Int. Conf. Learn. Anal. Knowl. - LAK ’15. 126–135 (2015). 11. Boyer, S., Veeramachaneni, K.: Robust Predictive Models on MOOCs: Transferring Knowledge across Courses. Proc. 9th Int. Conf. Educ. Data Min. 298–305 (2016). 12. Zhou, Z.-H.: Ensemble Learning. In: Li, S.Z. (ed.) Encyclopedia of Biometrics. pp. 2170– 273 (2009). 13. Robinson, C., Yeomans, M., Reich, J., Hulleman, C., Gehlbach, H.: Forecasting student achievement in MOOCs with natural language processing. Proc. Sixth Int. Conf. Learn. Anal. Knowl. - LAK ’16. 383–387 (2016). 14. Zou, H., Hastie, T.: Regularization and variable selection via the elastic-net. J. R. Stat. Soc. 67, 301–320 (2005). 15. Hall, M.: Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In: Proceedings of the Seventeenth International Conference on Machine Learning. pp. 359–366 (2000). 16. Sawyer, R.: Sample size and the accuracy of predictions made from multiple regression equations. J. Educ. Stat. 7, 91–104 (1982). 17. Bote-Lorenzo, M.L., Gómez-Sánchez, E.: Predicting the decrease of engagement indicators in a MOOC. In: Seventh International Conference on Learning Analytics and Knowledge. pp. 143–147 (2017).