=Paper=
{{Paper
|id=Vol-3024/paper6
|storemode=property
|title=Identification of class-representative learner personas
|pdfUrl=https://ceur-ws.org/Vol-3024/paper6.pdf
|volume=Vol-3024
|authors=Célina Treuillier,Anne Boyer
}}
==Identification of class-representative learner personas==
Identification of class-representative learner personas Célina Treuillier 1,2 and Anne Boyer1,2 1 Lorraine University, 34 Cours Léopold, 54000 Nancy, France 2 LORIA, 615 Rue du Jardin Botanique, 54506 Vandœuvre-lès-Nancy, France Abstract The student's interaction with Virtual Learning Environments produces a large amount of data, known as learning traces, which is commonly used by the Learning Analytics (LA) domain to enhance the learning experience. We propose to define personas, that are representative of subsets of students sharing common digital behaviors. The embodiment of the output of LA systems in the form of personas makes it possible to study the representativeness of the dataset with precision and act accordingly, but also to enhance the explicability to pedagogical experts who must manipulate these tools. These personas are defined from learning traces, which are processed to identify homogeneous subsets of learners. The presented methodology also allows to identify some outliers, that exhibit atypical behaviors, and thus makes it possible to represent the whole students, without privileging some of them. Keywords 1 Learning Analytics – Learning Systems – Learner Personas – Virtual Learning Environments – Explicability – Corpus representativeness. 1. Introduction The generalization of digital environments in education leads to the collection of big amounts of educational data, which can either be personal information on learners, academic performances of students, or interaction traces. This data could be processed by Learning Analytics (LA) tools. LA was defined in 2011 as "the measurement, collection, analysis, and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs" (1). It allows to understand the digital behaviors of students, to model, explain, or predict them, and then to better understand the use of a smart learning environment (SLE). The collection and exploitation of educational data lead to ethical questions such as privacy, security, informed consent, or bias (2). Some specific frameworks have been proposed, such as the DELICATE checklist (3) which provides a guide for assessing the proper use of educational data. More recently, some researchers (4) mention a need for a more complete and accurate evaluation of digital learning environments, going beyond the common evaluation which mainly deals with global algorithmic performances. Even if the computation of various measures (precision, recall, RMSE, MAE...) gives clues about the quality of the system (5), more pedagogical aspects are missing. This paper is a contribution to the design of a methodology dealing with the critical issue of the automatic identification of digital learning behaviors from educational data. Of course, knowing these digital learning behaviors leads to a more precise evaluation (performances can be given for each specific digital learning behavior). It may also give information to pedagogical experts on the way learners behave within a specific SLE, and therefore contributes to explicability. In this context, we propose to characterize learners’ online behaviors using learning indicators reflecting the behaviors (interaction, activity, learning) of a specific learner. They are computed from a subset of features available in learning traces and bring significant pedagogical information (6,7). Measuring such indicators makes it possible to differentiate students based on their behaviors, and thus LA4SLE @ EC-TEL 2021: Learning Analytics for Smart Learning Environments, September 21, 2021, Bolzano, Italy EMAIL: celina.treuillier@loria.fr (C.Treuillier) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) to provide them with more personalized support (8). Indeed, within a single class, not all students have the same needs and advice is not appropriate for all learners, especially in large groups in which students have varied backgrounds, objectives, and skills (9). All the more so since this lack of homogeneity among students is exacerbated in an online learning environment, which increases inequality (9). Several studies have already attempted to categorize students based on learning traces for different purposes: to identify students who can benefit from the same intervention by the instructor (10), to detect students who are going to drop out or students at risk (11,12), to evaluate performance (13), to provide adapted recommendations (14)... Here, we are more interested in defining online behaviors in order to characterize the dataset in a new way. That is why we propose to define the latter in the form of “personas”, corresponding to subsets of students sharing common behaviors. The description of the dataset in the form of personas will first allow us to analyze the representativeness of the corpus: the learning performances will be detailed according to the various subsets of students, and it will be possible to evaluate if some are under-represented, over-represented, or not represented, for example. But these personas will also allow improving the explicability by embodying the outputs of the system in the form of fictitious students to whom pedagogical experts can refer. The challenge is therefore to be able to define learner personas from the learning traces. The research question is then (RQ) How to define learner personas based on learning traces and indicators? To carry out this study, we work with the broadly used Open University Learning Analytics Dataset (OULAD), which is described in the following part. We present our methodology in the third section. The results are described in the fourth part. Finally, we conclude and give some perspectives. 2. Dataset and learning indicators The OULA Dataset (15) gathers data about 32,593 students involved in distance learning. It is fully anonymized and contains both demographic data, interaction data, as well as the results of the various evaluations. The interaction data mainly focused on the activity on available materials, i.e., the clicks made on specific resources, and are time stamped. Students may have 4 types of outcomes: pass, fail, withdrawn, or distinction. We select the presentation of February 2013 of the STEM module D (duration of 240 days, 14 assessments, 1303 students). As previously explained, the division of students into subsets sharing common digital behavior is based on learning indicators that we characterize from some existing studies. In total, 5 indicators are used: engagement (16), performance (17), regularity (18), responsiveness (18), and curiosity (19). Table 1 summarizes the description of the indicators. Table 1 Learning indicators. Indicator Definition Features Student’s outcomes Performance Scores in the 14 assessments, ranging from 0 to 100. (17) Responsiveness to Delay between the date the assignment is returned and the Reactivity course-related events deadline (in days). (18) Number of clicks on selected types of activities + Total Engagement Student activity (16) number of clicks all activities combined. Number of active days on selected types of activities + Total Behavioral patterns Regularity active days + Mean of the number of clicks per day on the of actions (18) same types of activities and global Intrinsic motivation Number of different types of activity consulted + Number of Curiosity (19) different resources consulted. When students did not turn in an assignment, did not get a grade for an assessment, or did not make any click, the initial dataset includes null values. We replace missing values with 0 when no clicks were made, or no results were indicated. Alike, when an assignment was not returned, we replace the missing values by 240, corresponding to the duration of the course. As resources are available a few weeks before the course starts, some students have a number of active days superior to the duration of the module, up to 260 days. Our initial dataset D thus included a total of 45 features corresponding to the description of the 5 learning indicators, for 1303 students. We divide D according to the 4 types of results and obtain 4 independent datasets whose size is summarized in Table 2. Each dataset is analyzed and thus undergoes various processing steps, which are detailed in the following section. Table 2 Dimensions of the four datasets. Number of Dataset Proportion students Pass 456 35,0% Fail 361 27,7% Withdrawn 432 33,2% Distinction 54 4,1% 3. Methodology To meet our challenges (evaluation of representativeness and enhancement of explicability), we propose to define homogeneous subsets of students adopting similar behaviors from a heterogeneous set. Each student is characterized by his profile, consisting of a sequence of learning traces. Some students present typical behaviors and cannot be associated with any sufficiently large subset. They are therefore considered as 'outliers' and are treated separately. The initial dataset 𝐷 is composed of several learners described by their profiles 𝑃1 , 𝑃2 , … , 𝑃𝑛 . Each profile 𝑃𝑖 , associated with a single student, is composed by a sequence of traces 𝑇𝑖,𝑗 (j-th trace of student associated with the profile 𝑃𝑖 ). The goal is to find homogeneous subsets 𝑆𝑘 , i.e., subsets of profiles 𝑃𝑖 composed by sequences of traces reflecting similar behaviors. Profiles that are too dissimilar are therefore considered as outliers 𝑂𝑝 . If the number of profiles 𝑃𝑖 in a subset 𝑆𝑘 is lower than a threshold , we consider the associated profiles as outliers. We will use these subsets to define "personas", which have been defined by Brooks and Greer (20) as “narrative descriptions of typical learners that can be identified through centroids of machine learning classification processes”. In our case, learner personas will be based on student interaction data with the learning environment and are defined from outcomes of the clustering method. It is important to note that our definition of personas differs from the one commonly used in UX design (21). Indeed, here, personas are used after the design phase of the tool to ensure that the latter can respond to all students with the same quality. Thus, the personas we define allow us to describe a digital learning behavior shared by several students likely to benefit from the same advice, to study the representativeness of the corpus, and to enhance the explicability. The applied methodology is broken down into different parts: first, the data undergoes a pre- processing phase during which we handled the null values (NAs) and standardized the data. Data standardization is a common process applied in Machine Learning, allowing to resize numerical variables to make them comparable on a common scale. After this pre-processing phase, we detect outliers: this allows splitting the initial dataset into inliers dataset and outliers dataset. Due to their atypical behavior, the outliers are examined independently, and the inliers are divided into subsets using an unsupervised clustering algorithm. Finally, the characteristics of each homogeneous group, i.e., the behaviors adopted, allow the definition of personas, which are descriptions of typical students to whom the system must be able to respond, and always with the same quality. 4. Experimentation 4.1. Description The whole implementation was performed using the Scikit Learn library for Python (15). For the standardization phase, after studying and comparing the different existing scalers, we selected the RobustScaler scaler proposed by ScikitLearn which is particularly adapted for datasets including outliers. We then applied the IsolationForest algorithm to isolate atypical data, with contamination set to 0,01. Finally, we processed the K-means algorithm, which is adapted for LA datasets (22), for the clustering phase. The centers of resulting clusters allow us to define the personas and analyze them. The quality of the partition is evaluated using the Davies-Bouldin criterion (23), and Silhouette analysis (24). All the steps were applied independently on our four datasets (Pass, Fail, Withdrawn, Distinction). 4.2. Results First, the IsolationForest algorithm allows to identify inliers and outliers, and therefore separate them into independent datasets. Inliers were then processed with the K-means algorithm for different values of K (2 to 10,12,15) and performance measures were computed to choose the optimal number of clusters (Table 4). The number of outliers and inliers for each dataset and the performances for the optimal value of K are given in Table 3. Table 3 Number of inliers and outliers, the optimal number of clusters, and performances. Davies- Optimal Silhouette Dataset Inliers Outliers Bouldin value of K Index Index Pass 451 5 10 0,70 0,78 Fail 357 4 8 0,16 0,91 Withdrawn 427 5 4 0,82 0,83 Distinction 53 1 6 0,05 0,88 For each dataset, clusters sizes, i.e., the number of students sharing similar behaviors within the same subset, differ greatly. Overall, in each dataset, there is a larger subset representing the major proportion of learners, and some smaller subsets, sometimes representing only one student. The larger subset corresponds to the prime persona: it is representative of the majority of students in the studied dataset. Smaller clusters (size > ) were therefore defined as under-represented personas. Please note that these personas, even if they represent fewer learners, need to be evaluated and treated with the same quality as prime personas. Finally, as explained, the students composing clusters of size smaller than = 10 are considered as outliers. These last exhibit unique behaviors and need to be treated separately because they must require adapted support, as those identified with the IsolationForest algorithm. In this paper, due to lack of space, we cannot describe all the personas, but we detail the most interesting and representative ones and give relevant values corresponding to the clusters’ centers of the described persona. Firstly, for successful students, the primary persona (Figure 1 - A) represents 69% of the dataset (312 learners). These students are very active (2240 clicks), especially on the forums (522 clicks). They are also regular since they are active for more than 130 days over the total duration of the module. The resources consulted are numerous (167). This active, regular, and curious behavior allows them to obtain good results throughout the module. If we now consider the students who failed, some of the under-represented students (62 students, 17,67%) (Figure 1 - B) were more active (1871 clicks), more regular (110 active days), more curious (145 resources consulted) than the majority of students in the same dataset. They turned in all the assignments on time but obtained low scores and therefore performed poorly. The work provided does not seem to allow this subset of students to succeed. Figure 1: A: Prime persona (Pass dataset), B: Under-represented persona (Fail dataset) Next, one of the outliers of the withdrawn dataset (Figure 2) shows an exemplary behavior at the beginning of the course with high activity (4267 clicks), a high regularity (178 active days), and curiosity (188 consulted resources), but gives up for the last assignment, which is not handed in. Figure 2: Outlier persona - Withdrawn dataset Interestingly, for the Distinction dataset, we do not observe any under-represented personas: students not belonging to the main subset are outliers. The described personas are interesting since they are diversified and allow to clearly differentiate the students according to their online behaviors. Besides, the personas of each dataset are very representative of the associated final result. Thus, the subsets of students identified as a result of our methodology are representative of a variety of digital behaviors, and therefore do not focus on describing the most common ones. In this way, the representativeness analysis of the corpus can be improved, ensuring that students engaging in underrepresented behaviors are identified and treated with the same quality as other students. Finally, the association of each persona with various learner indicators makes it possible to embody the results of LA algorithms in a clear and complete way that can be easily understood by learning experts and thus contribute to the enhancement of explicability. 5. Discussion and Perspectives The presented results show that it is possible to define learner personas from homogeneous subsets of students, based on learning indicators computed from learning traces. Thus, the presented methodology is different from existing ones, which generally only allow the identification of clusters. On the one hand, personas make it possible to represent a wide variety of behaviors adopted by the student population studied. It is to these different subsets of students that educational systems must be able to respond indiscriminately, even if some groups are representative of a larger or smaller population of students. Personas representing a very small number of students, or a single student, deserve as much attention as others and should not be dismissed. That is why we talk about representativeness: all students, regardless of their behavior, must receive the help that is adapted to them, always with the same quality and without some being over-, under-, or non-represented. On the other hand, embodying the results of LA algorithms in the form of personas seems to us to be an important step towards improving the explicability of systems, and at the same time, we have good hopes for increasing user confidence, reaching a wider audience, and having a positive impact on various stakeholders. Overall, this study provides a new approach to evaluate SLEs fairly, based on explainable LA to increase user confidence while developing more ethical systems. As a follow-up to this work, we plan to study some specific categories of learners, as repeated students, and to examine the presence of specific student profiles defined in the literature, such as those detailed in the ICAP model (25). Finally, we can also imagine improving the description of personas by allowing teachers to select the indicators best suited to their subject or pedagogy. 6. Acknowledgments This work is done in the framework of the LOLA (Laboratoire Ouvert en Learning Analytics) project, with the support of the French Ministry of Higher Education, Research and Innovation. 7. References 1. Siemens G, Long P. Penetrating the Fog: Analytics in Learning and Education. EDUCAUSE Review. 2011;46(5):30. 2. Slade S, Prinsloo P. Learning Analytics: Ethical Issues and Dilemmas. American Behavioral Scientist. 2013;57(10):1510–29. 3. Drachsler H, Greller W. Privacy and analytics: it’s a DELICATE issue a checklist for trusted learning analytics. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge - LAK ’16 [Internet]. Edinburgh, United Kingdom: ACM Press; 2016 4. Holmes W, Porayska-Pomsta K, Holstein K, Sutherland E, Baker T, Shum SB, et al. Ethics of AI in Education: Towards a Community-Wide Framework. Int J Artif Intell Educ [Internet]. 2021 5. Erdt M, Fernández A, Rensing C. Evaluating Recommender Systems for Technology Enhanced Learning: A Quantitative Survey. IEEE Transactions on Learning Technologies. 2015;8(4):326– 44. 6. Iksal S. Ingénierie de l’observation basée sur la prescription en EIAH. 2012; 7. Ben Soussia A, Roussanaly A, Boyer A. An in-depth methodology to predict at-risk learners. 16th European Conference on Technology Enhanced Learning [Manuscript submitted for publication]. 2021; 8. Mupinga DM, Nora RT, Yaw DC. The Learning Styles, Expectations, and Needs of Online Students. College Teaching. 2006;54(1):185–9. 9. Xu D, Jaggars SS. Performance Gaps between Online and Face-to-Face Courses: Differences across Types of Students and Academic Subject Areas. The Journal of Higher Education. 2014;85(5):633–59. 10. Mojarad S, Essa A, Mojarad S, Baker R. Data-driven learner profiling based on clustering student behaviors: learning consistency, pace and effort. 2018. 11. Haiyang L, Wang Z, Benachour P, Tubman P. A Time Series Classification Method for Behaviour-Based Dropout Prediction. In: 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT). 2018. p. 191–5. 12. Tempelaar D, Rienties B, Mittelmeier J, Nguyen Q. Student profiling in a dispositional learning analytics application using formative assessment. Computers in Human Behavior. 2018 Jan 1;78:408–20. 13. Lotsari E, Verykios VS, Panagiotakopoulos C, Kalles D. A Learning Analytics Methodology for Student Profiling. In: Likas A, Blekas K, Kalles D, editors. Artificial Intelligence: Methods and Applications [Internet]. Cham: Springer International Publishing; 2014 14. Paiva ROA, Bittencourt II, da Silva AP, Isotani S, Jaques P. Improving pedagogical recommendations by classifying students according to their interactional behavior in a gamified learning environment. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing [Internet]. Salamanca Spain: ACM; 2015 15. Kuzilek J, Hlosta M, Zdrahal Z. Open University Learning Analytics dataset. Sci Data. 2017;4(1):170171. 16. Hussain M, Zhu W, Zhang W, Abidi SMR. Student Engagement Predictions in an e-Learning System and Their Impact on Student Course Assessment Scores. Computational Intelligence and Neuroscience. 2018;2018:e6347186. 17. Arnold KE, Pistilli MD. Course signals at Purdue: using learning analytics to increase student success. In: Proceedings of the 2nd International Conference on Learning Analytics and Knowledge - LAK ’12 [Internet]. Vancouver, British Columbia, Canada: ACM Press; 2012 18. Boroujeni MS, Sharma K, Kidziński Ł, Lucignano L, Dillenbourg P. How to Quantify Student’s Regularity? In: Verbert K, Sharples M, Klobučar T, editors. Adaptive and Adaptable Learning. Cham: Springer International Publishing; 2016. p. 277–91. (Lecture Notes in Computer Science). 19. Pluck G, Johnson HL. Stimulating curiosity to enhance learning. 2011;9. 20. Brooks C, Greer J. Explaining predictive models to learning specialists using personas. In: Proceedings of the Fourth International Conference on Learning Analytics And Knowledge - LAK ’14 [Internet]. Indianapolis, Indiana: ACM Press; 2014 21. Lallemand C, Gronier G. Méthodes de design UX: 30 méthodes fondamentales pour concevoir et évaluer les systèmes interactifs. Paris: Eyrolles; 2016. 22. Navarro ÁAM, Ger PM. Comparison of Clustering Algorithms for Learning Analytics with Educational Datasets. IJIMAI. 2018;5(2):9–16. 23. Davies DL, Bouldin DW. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979;PAMI-1(2):224–7. 24. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. 25. Chi MTH, Wylie R. The ICAP Framework: Linking Cognitive Engagement to Active Learning Outcomes. Educational Psychologist. 2014;219–43.