1 Personalized Impact on Students in MOOCs Lubov Lisitsyna1, Oreshin Svyatoslav2 1 ITMO University, Kronvrkskiy pr. 49, Saint Petersburg, 197101, Russia lisizina@mail.ifmo.ru 2 ITMO University, Kronvrkskiy pr. 49, Saint Petersburg, 197101, Russia Aqice26@gmail.com Abstract. Churn prediction is a common task for machine learning applications in business. In this paper, we adapt this task to solve a massive open online courses’ low efficiency problem which formulates as a very low ratio of students who successfully finish a course. The presented approach is described and tested using course "Methods and algorithms of the graph theory" held on national platform of online education in Russia. This paper includes all the necessary steps in building an intelligent system to predict students who are active during the course, but not likely to finish it. The first part consists of constructing the right sample for prediction, EDA and choosing the most appropriate week of the course to make predictions on. The second part is about choosing the right metric and building models. Also, approach with using ensembles like stacking is proposed to increase the accuracy of predictions. As a result, we overview the outcome of using this approach on real students and discuss the results and further improvements. Our personalized impact showed that the majority of students (70%) perceive such an impact positively and it helps them to pass the hardest tasks considered online course. Keywords: Machine learning · Data science · Massive Open Online Course · Educational analytics · Learning analytics 1 Introduction The main problem of using Massive Open Online Courses (MOOC) is their low performance (no more than 5%), which is estimated as the proportion of successfully completing the course to the total number of students registered at the start of this course. The low performance analysis of MOOC [1] revealed a number of reasons related to the poor readiness of listeners for e-learning, with low motivation to achieve higher learning outcomes. Different proposes [1–6] were published and reviewed, monitoring situational awareness of the student when working with electronic forms before learning. In this paper, we adapt a churn prediction task to predict students’ churn in MOOCs. Classical churn prediction task is about building a model which finds a list of clients who are likely to break their contract. This task is also applied to predict Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 students’ churn in classical higher education [7]. If adapt this task to MOOCs, the formulation is different. The main different is about the information that we have about a student (client). In MOOCs, we have very limited data about a certain student, thus we need to collect enough data to make predictions in the process of student’s learning. We may also use meta information about student’s previous performance on courses and his open information in profile if possible. Firstly, we need to select the correct time period in the course, so we can use data of the whole students’ activity in selected course before this point. There may be several such points. Secondary, we need to choose the correct target for predictions. We propose to use a binary target of a fact of successfully passing the final exam on course. This target may differ because of different structures and difficulties on a certain course. Further, we overview the whole process to solve selected problem using machine learning approach and demonstrating its effectiveness on online course "Methods and algorithms of graph theory" by IFMO University. Finally, we discuss the experimental results and further goals of using proposed approach on real students. This article proposes a user-based approach to sampling statistical data recorded by the e-learning system during the course to predict the performance of an online course. This approach aims to increasing personalized monitoring of the e-learning process and adaptation of a platform for a certain student. 1 Data mining and Exploratory data analysis This section presents the process of collecting data from logs of activity in the platform, aggregating this data by every student and choosing the correct time period in the course to build predictions on. The process of data mining in MOOCs strongly depends on the structure of a course. That is why we should start with analyzing a certain course material to find the best approach and strategy to make predictions on. 1.1 Course material overview This research used statistical data accumulated on the national open education platform of the Russian Federation during the online course "Methods and algorithms of graph theory" (https://openedu.ru/course/ITMOUniversity/AGRAPH/) for the period from 2016 to 2019. The experiment of personalized impact on students took place in spring’s session in 2019. This online course [8-9] is conducted for 10 weeks twice a year (at the beginning of the fall and spring semesters), contains 41 video lectures with surveys and 11 interactive practical exercises. On the 10th week an online exam is held. Table 1 presents practical exercises presented in the course. As we can see, this course has many practical exercises and lasts 10 weeks of intensive studying. We need to choose the correct time period to make predictions on. By the time we make predictions, we should have enough information about student’s performance and we still need to have enough time to impact on a student in such a way to increase this student’s motivation thus, increase his or her chances to successfully finish this course. In other words, selecting a correct time period the have 3 a trade-off between timelessness of impact and having enough information about student’s performance. Table 1. Practical exercises of the course Algorithm Typical graph problem Week number Lee algorithm Search shortest route 2 Bellman-Ford algorithm Search route with minimal 2 weight Roberts-Flores algorithm Search for Hamilton loops 3 Prim algorithm Search for minimum 4 spanning tree Kruskal algorithm Search for minimum 4 spanning tree Magu-Weismann algorithm Search for largest empty 5 subgraphs Method based on Magu- Minimum vertex coloring of 6 Weisman algorithm graph Greedy heuristic algorithm Minimum vertex coloring of 6 graph Hungarian algorithm Search perfect matching in a 7 bipartite graph Algorithm based on ISD method Detecting of isomorphism of 8 two graphs Gamma-algorithm Graph planarization 9 To select the most appropriate time period to build predictions on, analysis of practical exercises in the middle of the course was performed. Figure 1 presents the distribution of an average time that a certain student needs to complete a practical exercise. As we can see, Magu-Weismann algorithm is held on 5th week (middle points of the course) and has a bimodal distribution, which can be an indicator that this task is complicated for some number of students. Also, we can conclude that the average time of solving a practical task is about 20 minutes. 4 Figure 1. Average time (in minutes) taken to complete practical exercises We have 10 weeks of course, so it is reasonable to take 5th week as a middle point of a course as a time to make predictions. Also, we can see that the most complicated task is held on 5th week. Completing this week strongly defines completing an overall course. As a conclusion, we take 5th week as a time point of a course on which we will make predictions about students’ performance on final exam. As was mentioned, this time point may differ for a certain online course. Choosing this time point, we recommend estimating the ratio of overall information that we can gain about a certain student, complicatedness of problems before and after this point and fulness about already learned material. 2 Task formalization and model fitting In this section, we formulate churn prediction problem in MOOC in machine learning terms and build a model for this binary classification task. After the model is tested and compared, we propose an ensembling approach to increase overall models’ performance. 5 2.1 Problem overview We have a problem of binary classification of target that indicates a successful passing the final exam of a course. Having a probability of successful passing, we rank all the students by these probabilities. Then, we need to find out students that we should apply additional impact on. These group of students should have be a group of active students but who have a low chance to successfully finish a course. Thus, the final probability should not be very low and very high. We suggest taking an expected value of students who pass an exam we use it as a higher bound of probability of this group. Then, we can take a various number of students depending on the resources that we have to apply additional impact. If we have an automatized process of applying additional impact, then we may take a big group. In our experiment we did a personalized impact sending emails with analysis certain student’s problems and giving him personalized advices depending on his or her case. It means that this type of impact has a high cost of human’s time to analyze each student’s case. Also, we can combine several types of impact with a different number of students and different threshold of likelihood in passing final exam. Evaluating classifiers for this task the ROC AUC metric [10] was chosen due to the operation of the probabilities of the object belonging to the class with different thresholds. Also, ROC AUC indicates the quality of ranking, which is the most important subtask to make a correct choice of this group of students to make impact on. The formulation of the problem is a probabilistic binary classification with further likelihood’s ranking. 2.2 Classifiers’ fitting and analyzing To build a baseline for this classification problem, support vector machine [11], logistic regression [12, 13], random forest [14] and gradient boosting on decision trees (GBDT) [15, 16] were chosen and validated. To evaluate different classifiers, nested cross-validation was used. Table 2 present the results of cross-validation for these models of ROC-AUC value and its std. As we can see from the table, GBDT has the best value of the chosen metric. For GBDT algorithm, we used XGBoost and CatBoost implementations which had comparable results of ROC AUC metric. Table 2. Results of cross-validation for baseline models Model ROC AUC Logistic Regressor 0.8699 Support vector machine (rbf) 0.8763 Random Forest 0.9027 Gradient boosting on trees 0.9153 6 But we have a hypothesis that we can improve out baseline due stacking [17, 18]. Figure 2 and Figure 3 presents the similarity plot between Logistic regression and GBDT classifiers on different folds on cross-validation. From plots, we can conclude that the predictions of these classifiers are very different in each certain case, but these classifiers have a high ROC AUC score. For this purpose, we choose one linear model and one tree-based model. We chose logistic regression as a linear model for stacking because SVM has constant probabilities with Radial basis function kernel (RBF) [19], which is not appropriate for ROC-AUC and stacking because of need of additional probability calibration, which can be bad in a general case of ensembling models. Support vector machine with a linear kernel is not able to operate with probabilities as other linear models because it does not apply any mapping into probability space. GBDT model was chosen as a tree-based model for further improvement because of the highest ROC AUC value. Figure 2. The similarity plot between GBTS and Logistic regression on 4 th fold of cross-validation 7 Figure 3. The similarity plot between GBTS and Logistic regression on 5 th fold of cross-validation Stacking was applied using another logistic regression model to build new predictions on predictions of the initial models. We chose the 3 rd session of a course as a validation set for meta classifier in ensembling. The results of cross-validation of Logistic Regression, GBDT and meta model in this ensemble is presented in Table 3. We conclude that stacking increased the results of each classifier. We will use stacking as a final model for this problem Table 3. Results of cross-validation for Logistic regression, GBDT and ensembling of these models Split Model ROC-AUC score Logistic Regression 0.9255 1 Gradient Boosting on decision trees 0.9546 Stacking 0.9767 Logistic Regression 0.8702 2 Gradient Boosting on decision trees 0.9302 Stacking 0.9688 Logistic Regression 0.9116 3 Gradient Boosting on decision trees 0.9780 Stacking 0.9742 Logistic Regression 0.7659 4 Gradient Boosting on decision trees 0.9117 8 Stacking 0.8876 Logistic Regression 0.8651 5 Gradient Boosting on decision trees 0.8925 Stacking 0.9160 Logistic Regression 0.8612 ± 0.0531 Gradient Boosting on decision trees 0.9189 ± 0.0427 Stacking 0.9304 ± 0.0459 After we fitted final model, we analyzed the results of feature importance of each model in an ensemble and calculated the feature importance for the final model. The results of feature importance for Logistic regression and Gradient Boosting are presented in Table 4 and Table 5 correspondingly. We took 5 the most important features for each base model in ensemble and calculated scalar product of these importance with the importance of meta model in stacking. As a result, we get the results of feature importance for the final model which is presented in Table 6. The most important features for Logistic regression are overall mean number of attempts in interactive tasks and mean number of attempts on a 4 th week. The most important features for GBDT are overall activity of a student during and activity of a student on the 2nd week. We can conclude that basically Logistic regression classifier pays attention on features with a statistic related with attempts on interactive tasks and GBDT classifier pays more attention on overall activity statistics. All these features were calculated before the 5th week of the course. The most important features for final model are the composition of feature importance from Table 4 and Table 5 with meta model’s coefficients. As we can conclude, the most important feature for the final model is activity of a student on the 2nd week of the course. Features as an overall number of attempts on a course and overall activity are also important. Table 4. Top 5 the most important features for Logistic regression Feature Importance Mean number of attempts on interactive tasks 20.16% Number of attempts of the 4th interactive task 11.94% Mean grade score 11.89% Number of attempts of the 1st interactive task 6.32% Number of attempts of the 6th interactive task 4.61% Table 5. Top 5 the most important features for Gradient Boosting Feature Importance Overall activity 18.38% Activity on the 2nd week 10.43% Number of attempts of the 2nd interactive task 5.46% Number of attempts of the 6th interactive task 5.28% Mean number of attempts on interactive tasks 5.17% 9 Table 6. Top 5 the most important features for the final model Feature Importance Activity on the 2nd week 8.25% Number of attempts of interactive tasks 7.87% Overall activity 7.64% Activity on the 1st week 5.89% Mean number of attempts of interactive tasks 5.72% In this section we discussed the formulation of the problem in machine learning terms, fitted and compared different classifiers and analyzed its feature importance. Finally, we applied ensembling strategy to increase overall performance of Logistic regression and Gradient boosting classifiers and analyzed its feature importance. Now, we can use fitted models to predict a boundary group of students to make a personalized impact on each group. In the next chapter we demonstrate the results of our experiment of making a personalized impact on students in Spring of 2019 session of the course. 3 Results of the experiment To select a correct threshold value, we take the percentage of students who passed the exam in previous sessions (5.6%) multiplied by the number of students in the current session. Table 7 presents the results of ranking students in the test set on their likelihood to complete the course, starting with the highest probability. After applying calculated threshold, we get a list of students who need to have an additional impact to increase the effectiveness of their learning (Table 8). The last column in the tables shows whether the participant has actually passed the exam (1 for yes, 0 for no). The resulting tables show that the model correctly ranks the students of the course according to their likelihood to pass the exam in general: there are only 2 of students who actually passed the final exam in the table with ranked students below selected threshold. Table 8 was used to create a group of students to make personalized impact on. We took 10 students for our experiment and made a personalized impact on this group of students. Table 7. Students with highest probability of examination Student Probability of examination Examinated Student 11 0.8661 1 Student 12 0.8616 1 Student 13 0.8542 1 Student 14 0.8221 0 Student 15 0.8217 0 Student 16 0.8162 1 Student 17 0.8038 1 Student 18 0.7765 0 Student 19 0.7719 1 10 Student 110 0.7666 0 Table 8. Students below the threshold of examinations Student Probability of examination Examinated Student 21 0.4741 0 Student 22 0.4453 0 Student 23 0.4352 0 Student 24 0.4232 0 Student 25 0.4216 1 Student 26 0.4163 0 Student 27 0.4015 0 Student 28 0.3793 1 Student 29 0.3771 0 Student 210 0.3348 1 After the boundary group of students was revealed, we made a personalized impact on that group. Our impact was sending email with personalized analysis of each student performance with hints and advices. This kind of impact is useful in this particular course because of the following reasons: 1) Student feels personalized treatment from the author of the course, and it increases his or her involvement in learning process. 2) Student can see his or her mistakes in solving practical exercises, so it increases his or her chances on further success in solving his or her problematic task. 3) Student gets personalized advices about particular topics that he or she should pay more attention at. 4) Student can give his or her feedback about the course and ask questions about incomprehensible topics. The methodology of making personalized impact may be different in each case. In MOOCs there are many parts of quizzes and practical exercises that we can add personalized feedback on if we correctly indicate the boundary group of students. It is important to give personalized feedback that is suitable for each particular student because in some cases it can decrease student’s motivation because of apparent simplicity. In other case, it can be not relevant to waste additional resource to make feedback on students who already gave up or didn’t plan to successfully finish a course. The results of using proposed personalized impact are presented in Figure 4. 70% of students from the target group positively responded to our mailings. 50% of students accepted our suggestions of helping, and 30% of students successfully passed the hardest task of the course. Such results indicate that students perceive such an impact positively and it helps students to pass the tasks that they feel problems with. 11 Figure 4. The results of personalized impact on students in Spring’s 2019 session 5 Conclusion In this research we indicated a problem of MOOCs’ low efficiency and proposed an approach to solve this problem using machine learning algorithms. We used online course “Methods and algorithms of graph theory” to show all the steps in building such a solution. We compared different classifiers and proposed an approach to increase overall quality using stacking. According to the results, the most significant features were obtained for assessing the fact that the exam was passed by the students. As a result of model’s prediction, a list of participants was received according to their probabilities of passing a final exam. This approach can be used as to increase the efficiency of learning of separated students and to improve course materials in general. Also, this problem can be interpreted as a churn prediction problem. After the final list of students is received, it can be used to make the course more personal for this group of students. As an example, we suggest giving some hints and additional bonuses for the student if he or she will continue learning or increasing deadlines. Results of the final model analysis can be used for exploring aspects of the course that are important for a separate group of students. Thus, this article proposes a general approach for assessing and identifying MOOCs’ students during the course, on which additional impact is required to improve the performance of e-learning using MOOC. Using this approach in MOOCs can increase effectiveness of online courses and make e-learning more self-organized and adaptive for a separate student. Finally, we present the results of our experiment of using personalized impact on found boundary group of students. This personalized impact was positively perceived by students and helped them to solve the hardest practical problem of a course. References 1. Lisitsyna L.S., Efimchik E.А., “Making MOOCs more effective and adaptive on the basis of SAT and game mechanics”, Smart Education and e-Learning, 2018, 75, 56-66. 2. Liubov S. Lisitsyna, Andrey V. Lyamin, Ivan A. Martynikhin, Elena N. Cherepovskaya, “Situation Awareness Training in E-Learning”, Smart Education and Smart e-Learning, 2015, 41, 273-285. 12 3. Liubov S. Lisitsyna, Andrey V. Lyamin, Ivan A. Martynikhin, Elena N. Cherepovskaya, “Cognitive Trainings Can Improve Intercommunication with e-Learning System”, 6th IEEE international conference series on Cognitive Infocommunications, 2015, 39-44. 4. Lisitsyna L.S., Pershin A.A., Kazakov M.A., “Game Mechanics Used for Achieving Better Results of Massive Online”, Smart Education and Smart e-Learning, 2015, 183- 193. 5. Oreshin S.A., Lisitsyna L.S., Machine learning approach of predicting learning outcomes of MOOCs to increase its performance//Smart Innovation, Systems and Technologies, IET - 2019, Vol. 144, pp. 107-115 6. Oreshin S.A., Lisitsyna L.S., Sampling and analyzing statistical data to predict the performance of MOOC//Smart Innovation, Systems and Technologies, IET - 2019, Vol. 144, pp. 77-85 7. Pradeep, S. Das, J. J. Kizhekkethottam, A. Cheah et al., "Students dropout factor prediction using EDM techniques", Proc. IEEE Int. Conf. Soft-Computing Netw. Secur. ICSNS 2015, pp. 544-547, 2015. 8. Lisitsyna L.S., Efimchik E.A., “Designing and application of MOOC «Methods and algorithms of graph theory» on National Platform of Open Education of Russian Federation”, Smart Education and e-Learning, 2016, 59, 145-154. 9. Lisitsyna L.S., Efimchik E.A., “An Approach to Development of Practical Exercises of MOOCs based on Standard Design Forms and Technologies”, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2017, 180, 28-35. 10. Tom Fawcett, An introduction to ROC analysis, In: Pattern Recognition Letters, 27 (2006), 861–874. 11. D. Michie, D. J. Spiegelhalter, C. C. Taylor, and J. Campbell, editors. Machine learning, neural and statistical classification. Ellis Horwood, Upper Saddle River, NJ, USA, 1994. ISBN 0-13-106360-X. 12. Chao-Ying Joanne, Peng Kuk Lida Lee, Gary M. Ingersoll, “An Introduction to Logistic Regression Analysis and Reporting”, The Journal of Educational Research, 2002, 96, 3- 14. 13. Marijana Zekić-Sušac, Nataša Šarlija, Adela Has, Ana Bilandžić, “Predicting company growth using logistic regression and neural networks”, Croatian Operational Research Review, 2016, 7, 229–248. 14. Boulesteix, A.-L., Janitza, S., Kruppa, J. and König, I.R. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 2, no. 6, pp. 493–507. 15. Jerome H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, In: The Annals of Statistics, 2001, 5, 1189-1232. 16. Friedman, Jerome H., “Stochastic gradient boosting”, Computational Statistics and Data Analysis, 2002, 38(4), 367–378. 17. Smyth, P. and Wolpert, D. H., “Linearly Combining Density Estimators via stacking”, Machine Learning Journal, 1999, 36, 59-83. 18. David Opitz, Richard Maclin, Popular Ensemble Methods: An Empirical Study, Journal of Artificial Intelligence Research, 1999, 11, 169-198. 19. Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010). "Training and testing low-degree polynomial data mappings via linear SVM". Journal of Machine Learning Research. 11: 1471–1490.