Learning analytics of MOOCs based on natural language processing Yulia Yu. Dyulicheva1 , Elizaveta A. Bilashova1 1 V.I. Vernadsky Crimean Federal University, 4 Vernadsky Ave., Simferopol, 295007, Crimea Abstract The perspectives of application of machine learning, especially, decision trees, random forest and deep learning for educational data mining problem solving, and learning analytics tools development are considered in the paper. The abilities of sentiment analysis with BERT deep model, clustering based on kMeans with the different approaches to the text vectorization are investigated for the development of learning analytics tools on the example of the learning analytics of some programming MOOCs from Udemy. We analyze 300 titles of MOOCs and proposed their clustering for better understanding the directions of learning and skills, and 1150 sentences that contain the word “teacher” or its synonyms and 2365 sentences about the course for sentiments detection of students and top of words that describe opinions with positive and negative polarities and the issues during learning. Keywords MOOC, sentiment analysis, BERT deep model, learning analytics 1. Introduction Recently MOOCs and various e-learning environments propose Big Data connected with stu- dents and tutors activities in form of clickstreams, sequences of video watching, and interactions with different types of learning content, video streams from the camera during learning for facial students’ expressions recognition, and students’ comments on various social media platforms. The diversity of data in education requires special approaches to handling and data recognition. Machine learning algorithms are widely used in learning analytics (LA) and educational data mining (EDM) [1]. EDM is considered as a methodology for mining regularities from big educa- tional data that are gathered in educational environments [2]. LA is aimed at tools development for analyzing and optimization learning [3, 4]. During the COVID-19 pandemic, when distance learning became the only possible form of education, the development of tools for solving the problem of assessing the quality of education and analyzing feedback from students in the form of answers to questions of google-forms, comments and reviews on various media platforms attracted the attention of many researchers as one of the most pressing tasks of learning analytics [5]. CS&SE@SW 2021: 4th Workshop for Young Scientists in Computer Science & Software Engineering, December 18, 2021, Kryvyi Rih, Ukraine " dyulicheva@gmail.com (Y. Yu. Dyulicheva); lizatkchk@mail.ru (E. A. Bilashova) ~ https://researchgate.net/Yulia-Dyulicheva (Y. Yu. Dyulicheva)  0000-0003-1314-5367 (Y. Yu. Dyulicheva); "" (E. A. Bilashova) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 187 There are different problems in LA and EDM formulated as supervised learning, unsupervised learning, or reinforcement learning problems connected with the extraction of qualitative regularities from educational data for understanding students’ and tutors’ behavior and activities. The following purposes are stated in the paper • studying the effectiveness of the machine learning algorithms application for solving learning analytics problems; • the development of the learning analytics tool for extracting the regularities from some programming language MOOCs using python libraries. 2. Literature review Consider some machine learning algorithms and their usage for educational data mining and learning analytics tools development. 2.1. The decision trees and random forest for EDM and LA Social influences, math achievements and students engagement (their interests and beliefs) impact acceptance or vice versa rejection from engineering specialities choice. It is important to develop tools for monitoring the causes of negative relationships to engineering specialities and specialities that required high-quality math knowledge. Tan et al. [6] noted the perspective of the linear regression model and random forest usage to investigate how demographic and family background factors, high school factors and students’ academic achievements with their non-cognitive behaviour influence their involvement in engineering learning choice. The task of predicting student performance is one of the primary tasks in e-learning systems aimed at identifying factors that affect student performance; assessing the quality of learning; identifying groups of lagging and successful students, etc. Ahmed and Elaraby [7] considered the problem of academic performance prediction as a supervised learning problem in the case of the target function with discrete values. Authors constructed decision tree on feature space like faculty, average score with intervals (excellent rating with score >=85%, very good rating with score >=75% & <85%, good rating with score >=65% & <75%, bad rating with score >= 50% & <65%, very good rating with a score < 50%); results of the tests; assessing of learning in seminars, students’ activity as a binary value (yes or no), homework availability as a binary value (yes or no), and received decision rules for decision making about students final score results. The hybrid approach based on the decision tree together with a genetic algorithm was used for identifying the successful students’ groups with a description of the professional skills that will be useful for employers [8]. Decision trees can be used to describe the key characteristics of certain groups of students – listeners of MOOCs. Identifying such target groups can help the tutor better understand the target audience of online courses. Topîrceanu and Grosseck [9] allocated the following target groups of students according to their archetypes: 1) egocentric learners (interested in getting the skills they need on a regular basis); 2) short-term learners (focused on short quick courses); 188 3) comfort-oriented learners (focused not only on short but also easy-to-learn online courses); 4) interactive learners (focused on short courses and the possibility of active interaction in communities, and among these learners there are those who seek to obtain the certain competencies and those who use online courses for fun); 5) learners in order to distract themselves (the duration of the course is not important for them); 6) learners aimed at competencies searching (focused on the rapid acquisition of professional skills, but unable to complete the course); 7) curious initiators (learners who attend courses out of curiosity with a probability of com- pleting them approximately 0.5); 8) limited learners focus on the shortcomings of online courses and are frustrated by the lack of personalized approach; 9) optimistic learners (learners that always set to success); 10) learners with the Pygmalion effect, justifying their personal failures on the course by the fact that most of the learners did not complete online courses. 2.2. Deep learning for EDM and LA 2.2.1. Sentiment analysis in education The challenges of this area are closely related to identifying the mood of students in relation to this course, teacher, university, etc., based on the analysis of text content in the form of reviews, comments, posts on various media platforms, including social networks. Analysis of students’ communities and content of students’ profiles are also a very important challenge for aggressivity identifying [10]. Sentiment analysis and analysis of opinions allow us to identify the emotional component (boredom, aggression, fear, antipathy, frustration, inspiration, joy, lively interest, etc.) associated with learning, to understand the reasons for student behavior (loss of interest in the learning process, involvement in the educational process, etc.), to detect their interests and needs, as well as the expectations of students and their perception of reality. Another application of sentiment analysis in education is the development of systems for the qualitative assessment of the work of teachers and universities, teaching methods, and the quality of the educational process based on the analysis of feedback from students. Sutoyo et al. [11] noted the similarities between understanding behavior and strategies developing for customer and learner retention when using e-learning environments. Strategies for retaining learners are closely related to improving learning content, teaching styles, curriculum personalization, and individual learning trajectories development, development of tools for monitoring and estimation of teachers and students activity. For individual learning trajectories construction, it is important to understand students’ characteristics of personality that can be extracted with the help of students’ opinion mining from social media data [12]. The main stages of the study of text content in educational analytics include text understand- ing based on natural language processing (NLP) methods with subsequent vectorization (BoW, one hot embedding, CoVe embedding, word2vec, etc) and the use of machine learning methods (SVM, neural networks, Random Forest, etc) for classifying or clustering educational text data. Dsouza et al. [13] investigated the effectiveness of various machine learning methods for the 189 sentimental analysis of student feedback and note that the most accurate was Multinomial Naive Bayes Classifier versus SVM and Random Forest. Of particular interest is the development of text mining together with deep learning for extraction regularities from students’ feedback. [11] proposed the usage of convolutional neural networks with three types of layers for classification students’ comments by sentiments. For feature extraction from texts, authors used a convolution layer with different kernels sizes, pooling layer, and fully connected layer with softmax activation function for result interpretation as sentiments class. [14] used sentiment analysis for teacher performance assessment based on LSTM architecture with embedding layer (word2vec and pre-trained model based on it), LSTM layer, Dense layer, and softmax activation function for final classification, and achieved about 90% accuracy. The results of learning analytics can be used to develop intelligent systems for monitoring activity, involvement, and changes in the emotional states of students, for example, math anxiety [15]. Barron-Estrada et al. [16] investigated CNN together with LSTM and achieved 88.26% accuracy. Besides, the authors noted the effectiveness of CNN for secondary emotions detection. 2.2.2. Academic performance and students’ dropout prediction Academic performance research is crucial for universities, MOOCs, and students themselves because it detects the prestige of universities or MOOCs, forms their reputation, and influences on future careers of students. Besides, the development of learning analytics tools for detecting students from the category “at dropout risk” is very important for teachers/tutors/instructors and courses/curriculums creators. Mondal and Mukherjee [17] proposed the usage of Recurrent Neural Networks with only one hidden layer with 40 neurons and the ReLu activation function to analyze data about 480 students with 17 features to predict the students’ performance and achieved an accuracy of around 85%. The same dataset xAPI-Edu-Data on Kaggle for analysis was used by Bendangnuksung and Prabu [18]. The authors considered deep neural network (DNN) for students’ performance prediction based on data about 500 students described with the help of 16 features (gender, nationality, grades, topic, and different types of students and parents activities). They used two hidden layers of DNN with ReLu and softmax activation function for the layers respectively and achieved up to 84.3% accuracy. The students’ grade prediction was investigated by Yousafzai et al. [19] based on the hybrid deep neural network such as network with bidirectional long short-term memory (BiLSTM) with four types of layers: embedding layer, dropout layer, bidirectional LSTM layer, attention layer with tanh and softmax activation functions, output layer with sigmoid activation function, and special procedure for choice of the significant feature that more influence on students’ grade. Such architecture of BiLSTM with significant feature extraction allows achieving the accuracy of 90.16%. The research of students’ sequence of clicks and the sequence of video content watching usually is done based on RNN architecture. Jeon and Park investigated the influence of clicks during learning video watching on students’ dropout. For a description of students’ activity authors proposed to analyze the n-grams of clicks and video embedding together with the gated recurrent unit (GRU) and achieved 78.3% accuracy of students’ dropout [20]. The effectiveness of deep neural network with 5 hidden layers and ReLu activation function after each hidden fully- connected layers and softmax activation function for output layer and achieved 90.2% accuracy 190 was demonstrated in [21]. Xiong et al. [22] used RNN-LSTM for prediction next behavior of learners based on their previous behavior such as the sequences of previous observable actions of learners in form of the responses, comments, posts, and achieved about 90% accuracy. 2.2.3. Recognition of students’ facial expressions The development of approaches to assessing the emotional state of students and teachers during classes can be used to improve the educational process, which is especially important for distance learning, the quality of interaction between students and teachers, the atmosphere within the student team, and identifying content that causes difficulties. D’Errico et al. [23] note the importance of recognizing the cognitive emotions of students for the formation of a positive perception of the educational process, creativity, and success of students. The development of learning analytics tools based on the recognition of cognitive emotions of students contribute to the creation of systems for tracking student engagement, assessing the complexity of tasks and their impact on academic performance. Lee and Lee [24] considered a deep neural network with three convolutional layers, max-pooling between them, and finally, a fully connected layer with softmax activation function for students’ expressions classifying according to difficulties level (easy, neutral, hard) during exams. Sharma and Mansotra [25] proposed the usage of a convolutional neural network for recognition of students’ facial emotions such as sad, happy, neutral, angry, disgust, surprise, and fear for moods and psychological atmosphere detection. Many researchers note the importance of studying not only the emotional state of students but also teachers, since the perception of a teacher as a person who confidently demonstrates his knowledge has a significant impact on the perception of the discipline and can create a stable positive or negative attitude towards the discipline throughout life. In particular, a deep neural network with a convolutional layer, pooling layer with dense blocks, and RELM classifier for final detection of instructors’ facial expressions from one of 5 classes: awe, amusement, confidence, disappointment and neutral was described in [26]. 3. Dataset and methodology We investigate the data of 300 MOOCs on the studying of most popular programming languages for machine learning such as Python and R and scrapped 6000 reviews on them and extracted 1150 sentences that describe teacher, tutor, or instructor and 2365 sentences that describe the course or tutorial. Consider the main steps of the research. 1. Clustering titles of MOOCs with the help of a bag of words for vectorization and simple cosine measure with the dot product of two vectors in numerator and product of two Euclidean norms of vectors in the denominator for detection of similarity of titles and, additionally, clustering kMeans method. 2. The sentiment analysis of 1150 sentences that contain the word “teacher” or its synonyms and 2365 sentences about course, tutorial or its synonyms was performed. We used model BERT for the sentiment analysis of each group. BERT model (Bidirectional Representation 191 for Transofermers) based on transfer learning and has a complex architecture with 12 or 24 encoder stack layers with 12 or 24 bidirectional heads realized self-attention and 768 or 1024 hidden units [27]. We used python-library PyTorch and encode method for text vectorization based on pre-train model BERT. The received tensor with five scores we transformed into convenient polarities from 1 to 5: the score 1 and 2 — for negative polarity, 3 — neutral, 4 and 5 — for positive polarity. 3. Frequency analysis of reviews based on polarities. We extracted the top of words with high frequency for each group of sentences with different polarities. 4. Results According to the main steps of our methodology we received the following results. 4.1. The titles clustering of MOOCs The results of clustering of 300 MOOCs from Udemy based on kMeans and cosine similarity measure are presented in figure 1. Figure 1: The results of 14 clusters detection based on the titles of MOOCs. The top of words from 14 clusters allows listeners to understand the presence of MOOCs and the main directions of learning. Such words agree well with the extraction of key skills and basics that are learned during the online course and will be used to match popular job 192 vacancies in the labor market. The most presented courses on Udemy are the courses aimed at beginner and advanced levels for machine learning, in particular, deep learning, web scrapping, forecasting, computer vision, natural language processing. 4.2. Sentiment analysis of reviews of MOOCs based on BERT We investigated three aspects of MOOCS: relationship to teacher and course, and description of issues during the learning and used python-library transformers and pre-trained text pre- processing for text vectorization and sentiments of texts detection. The results of sentiment analysis with BERT deep model are presented at figure 2. Figure 2: The Sentiments Detection about Relationship to Teacher (right) and Course (left) based on BERT. The distribution of opinions by groups, taking into account the polarity, shows that students of MOOCs in programming are more demanding on the course than on the teacher, but in general, they show a positive attitude. 4.3. Frequency analysis of reviews based on polarities and aspects Let’s create clouds of words based on frequency analysis and wordcloud python-library, taking into account sentiments and the aspects under study. The results of a frequency analysis are shown in table 1. Frequency analysis can be used as an additional tool for understanding the causes of students’ issues, for example, we extracted such troubles as installation of libraries, settings, development of apps with GUI, hard understanding of two programming languages at the same time, etc. for programming MOOCs. We have highlighted words that well describe the positive and negative emotions of MOOCs listeners. The most frequently positive words that MOOCs listeners were used for positive emotions description about the relationship to instructors were knowledgeable, clear, talented, great, patient, etc., and for the description of negative emotions the words boring, bad, hard, unprofessional, etc. were used. The positive emotions to course learners of MOOCs expressed in words great, good, complete, amazing, friendly, etc., and for negative emotion description they used such words as difficult, bad, confused, expected deep, uninformative, lacked, etc. 193 Table 1 The cloud of words with polarities and aspects Aspects Positive polarity Negative polarity teacher course issue - We demonstrate that even the usage of the simple tool developed based on python natural language processing libraries gives an understanding of the advantages and disadvantages of MOOCs that will be used by developers, experts, and instructors for better quality MOOCs creation that oriented on preferences of each listener of MOOC based on listener feedback analytics. 194 5. Conclusion We demonstrated that text mining approaches based on the deep BERT model and clustering can be considered as instruments for learning analytics. Such instruments are aimed at the creation of personalized MOOCs and their contents and understanding the issues, preferences, and needs of students. The feedback from students in form of comments, reviews, and posts can be used for assessing the quality of education and detecting the direction for its improvement. We demonstrated that learners of MOOCs are more demanding to the course and its content, and their opinions, in general, have a positive sentiment. References [1] M. S. Mazorchuk, T. S. Vakulenko, A. O. Bychko, O. H. Kuzminska, O. V. Prokhorov, Cloud technologies and learning analytics: Web application for PISA results analysis and visualization, CEUR Workshop Proceedings 2879 (2020) 484–494. [2] P. Bachhal, S. Ahuja, S. Gargrish, Educational data mining: A review, Journal of Physics: Conference Series 1950 (2021) 012022. doi:10.1088/1742-6596/1950/1/012022. [3] E. İnan, M. Ebner, Learning Analytics and MOOCs, in: P. Zaphiris, A. Ioannou (Eds.), Learning and Collaboration Technologies. Designing, Developing and Deploying Learning Experiences, Springer International Publishing, Cham, 2020, pp. 241–254. doi:10.1007/ 978-3-030-50513-4_18. [4] S. Nunn, J. T. Avalla, T. Kanai, M. Kebritchi, Learning analytics methods, benefits, and challenges in higher education: A systemic literature review, Online Learning 20 (2016). doi:10.24059/olj.v20i2.790. [5] M. Umair, A. Hakim, A. Hussain, S. Naseem, Sentiment analysis of students’ feedback before and after COVID-19 pandemic, International Journal on Emerging Technologies 12 (2021) 177–182. URL: https://www.researchgate.net/publication/353305417_Sentiment_ Analysis_of_Students%27_Feedback_before_and_after_COVID-19_Pandemic. [6] L. Tan, J. B. Main, R. Darolia, Using random forest analysis to identify student demographic and high school-level factors that predict college engineering major choice, Journal of Engineering Education 110 (2021) 572–593. doi:10.1002/jee.20393. [7] A. B. E. D. Ahmed, I. S. Elaraby, Data mining: A prediction for studentś performance, World Journal of Computer Application and Technology 2 (2014) 43–47. doi:10.13189/ wjcat.2014.020203. [8] H. Hamsa, S. Indiradevi, J. J. Kizhakkethottam, Student academic performance prediction model using decision tree and fuzzy genetic algorithm, Procedia Technology 25 (2016) 326–332. doi:10.1016/j.protcy.2016.08.114. [9] A. Topîrceanu, G. Grosseck, Decision tree learning used for the classification of student archetypes in online courses, Procedia Computer Science 112 (2017) 51–60. doi:10.1016/ j.procs.2017.08.021. [10] F. K. Ventirozos, I. Varlamis, G. Tsatsaronis, Detecting aggressive behavior in discussion threads using text mining, in: A. Gelbukh (Ed.), Computational Linguistics and Intelligent 195 Text Processing, Springer International Publishing, Cham, 2018, pp. 420–431. doi:10. 1007/978-3-319-77116-8_31. [11] E. Sutoyo, A. Almaarif, I. T. R. Yanto, Sentiment analysis of student evaluations of teaching using deep learning approach, in: J. H. Abawajy, K.-K. R. Choo, H. Chiroma (Eds.), International Conference on Emerging Applications and Technologies for Industry 4.0 (EATI’2020), Springer International Publishing, Cham, 2021, pp. 272–281. doi:10.1007/ 978-3-030-80216-5_20. [12] A. Khowaja, M. H. Mahar, H. Nawaz, S. Wasi, S. ur Rehman, Personality evaluation of student community using sentiment analysis, International Journal of Computer Science and Network Security 19 (2019) 167–180. [13] D. D. Dsouza, Deepika, D. P. Nayak, E. J. Machado, N. D. Adesh, Sentimental analysis of students feedback using machine learning techniques, International Journal of Recent Technology and Engineering 8 (2019) 986–991. URL: https://www.ijrte.org/wp-content/ uploads/papers/v8i1s4/A11810681S419.pdf. [14] I. A. Kandhro, S. Wasi, K. Kumar, M. Rind, M. Ameen, Sentiment analysis of students’ comment using long-short term model, Indian Journal of Science and Technology 12 (2019). doi:10.17485/ijst/2019/v12i8/141741. [15] Y. Dyulicheva, Learning Analytics in MOOCs as an Instrument for Measuring Math Anxiety, Voprosy obrazovaniya / Educational Study Moscow (2021). doi:10.17323/ 1814-9545-2021-4-243-265. [16] M. L. Barron-Estrada, R. Zatarain-Cabada, R. Oramas-Bustillos, Emotion recognition for education using sentiment analysis. research in computing science, Research in Computing Science 148 (2019) 71–80. doi:10.13053/rcs-148-5-8. [17] A. Mondal, J. Mukherjee, An approach to predict a student’s academic performance using Recurrent Neural Network (RNN), International Journal of Computer Applications 181 (2019) 1–5. doi:10.5120/ijca2018917352. [18] Bendangnuksung, P. Prabu, Students’ performance prediction using deep neural network, International Journal of Applied Engineering Research 13 (2018) 1171–1176. [19] B. K. Yousafzai, S. A. Khan, T. Rahman, I. Khan, I. Ullah, A. Ur Rehman, M. Baz, H. Hamam, O. Cheikhrouhou, Student-performulator: Student academic performance using hybrid deep neural network, Sustainability 13 (2021). URL: https://www.mdpi.com/2071-1050/13/ 17/9775. doi:10.3390/su13179775. [20] B. Jeon, N. Park, Dropout prediction over weeks in MOOCs by learning representations of clicks and videos (2020). arXiv:2002.01955. [21] J. Whitehill, K. Mohan, D. Seaton, Y. Rosen, D. Tingley, Delving deeper into MOOC student dropout prediction, 2017. arXiv:1702.06404. [22] F. Xiong, K. Zou, Z. Liu, H. Wang, Predicting learning status in MOOCs using LSTM, in: Proceedings of the ACM Turing Celebration Conference - China, ACM TURC ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 74. doi:10.1145/ 3321408.3322855. [23] F. D’Errico, M. Paciello, B. D. Carolis, A. Vattanid, G. Palestra, G. Anzivino, Cognitive emotions in e-learning processes and their potential relationship with students’ academic adjustment, International Journal of Emotional Education 10 (2018) 89–111. URL: https: //files.eric.ed.gov/fulltext/EJ1177644.pdf. 196 [24] H.-J. Lee, D. Lee, Study of process-focused assessment using an algorithm for facial expression recognition based on a deep neural network model, Electronics 10 (2021) 54. URL: https://www.mdpi.com/2079-9292/10/1/54. doi:10.3390/electronics10010054. [25] A. Sharma, V. Mansotra, Deep learning based student emotion recognition from facial expressions in classrooms, International Journal of Engineering and Advanced Technology 8 (2019) 4691–4699. doi:10.35940/ijeat.F9170.088619. [26] Y. K. Bhatti, A. Jamil, N. Nida, M. H. Yousaf, S. Viriri, S. A. Velastin, Facial expression recognition of instructor using deep features and extreme learning machine, Computational Intelligence and Neuroscience 2021 (2021) 5570870. doi:10.1155/2021/5570870. [27] S. A. Rauf, Y. Qiang, S. B. Ali, W. Ahmad, Using BERT for checking the polarity of movie reviews, International Journal of Computer Applications 177 (2019) 37–41. doi:10.5120/ ijca2019919675. 197