Large Language Models for Learner Assistance in Massive Open Online Courses: Challenges, Tools, and Approaches Jesus-Angel del-Hoyo-Gabaldon1, Eva Garcia-Lopez1, Antonio Garcia-Cabot1,∗, David de-Fitero-Dominguez1, Mary-Ellen Wiltrout2, Jessica Sandland2 and Ana Bell2 1 Universidad de Alcalá, Ctra. Madrid-Barcelona km 33.6, 28805 Alcalá de Henares, Spain 2 Massachusetts Institute of Technology, Massachusetts Avenue 77, 02139 MA, Cambridge, USA Abstract Artificial Intelligence has undergone a significant revolution in recent years. The emergence and subsequent development of the Transformers architecture led to extensive research resulting in large language models (LLMs). These systems power widely used applications, such as ChatGPT, which is based on LLMs fine-tuned with human instructions to enhance their performance. Evidence shows that they surpass the results of previous models (BERT/GPT/T5 families) in terms of outcomes, even when they have less complex configurations. In addition, online courses and Massive Open Online Courses (MOOCs) experience a well-known issue: high dropout rates. Scholars aim to tackle this problem by introducing innovative systems and alternatives to enhance students’ learning experiences and prevent course abandonment. One option is to complement these courses with LLMs, which can incorporate chatbots or comparable systems to facilitate learning and engagement. The present proposal focuses on developing an automated system wherein a chatbot automatically produces questions related to course content. Then, learners will receive feedback on their answers through semantic similarity mechanisms, indicating the specific content they need to review. Keywords Massive Open Online Course, MOOC, Transformers, Large Language Models, LLM, Deep Learning, Multiple Choice Question Generation, Visual Question Generation1 1. Introduction One of the most compelling features of Massive Open Online Courses (MOOCs) is their ability to reach an extremely diverse set of learners. MOOC learners come from a wide range of demographic backgrounds, and they represent a wide variety of ages, educational backgrounds, and employment levels [1]. Furthermore, the diversity of MOOC learners is not limited to demographics. The motivations that learners have for participating in MOOCs can differ greatly. Perdue identifies a variety of motivations that drive MOOC participation, including curiosity and exploration, skill acquisition, and the desire to connect with others [2]. This range of motivations leads to ∗ Corresponding author. jesus.hoyo@edu.uah.es (J. A. del-Hoyo-Gabaldon); eva.garcial@uah.es (E. Garcia-Lopez); a.garciac@uah.es (A. Garcia-Cabot); david.fitero@uah.es (D. de-Fitero-Dominguez); mew27@mit.edu (M. E. Wiltrout); jgsandla@mit.edu (J. Sandland); anabell@mit.edu (A. Bell) 0000-0002-7598-3289 (E. Garcia-Lopez); 0000-0002-0298-3237 (A. Garcia-Cabot); 0000-0002-4647-4282 (M. E. Wiltrout) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings learners interacting with MOOCs in ways which range from sampling specific course content of interest to engaging with all of the course material and earning a certificate of completion. This diversity of participants provides a unique opportunity for instructors to engage learners who differ substantially in background and motivation from the students in their university classrooms; however, it also provides a unique challenge. Meeting the needs of a group of learners with vastly different backgrounds, preparation, and motivations is challenging within the context of a single course. Many MOOCs, including those considered in this paper, aim to provide a university-level learning experience for course participants, including challenging formative and summative assessments. However, not all learners enter a course with the background and skills necessary to successfully complete these challenging assessments. Indeed, both Gütl et al. and Onah, Sinclair and Boyatt have identified course difficulty and lack of learner preparation as one factor motivating learner dropout from MOOCs [3], [4]. This study proposes leveraging large language models (LLMs) to help bridge this gap between learners’ current levels of knowledge and the knowledge required to successfully engage with the course assignments. Additionally, we aim to employ LLMs to create additional content for learners who do not aim to complete all the assessments in the course, but rather choose to engage with the course content in a less formal way. The emergence of the Transformer architecture [5] gave rise to transformer-based LLMs, which had a profound impact on the field of Natural Language Processing (NLP). Currently, the most promising approach is to align them with human instructions [6]. Consequently, research has been conducted to investigate the impact of instruction-based LLMs in learning. A recent example demonstrates that a chatbot powered by GPT-4 enhanced the performance of adopters on an exam in an online coding class [7]. However, the engagement of learners was reduced, suggesting that a different approach to integration may be required. Other examples illustrate other uses in MOOCs, such as a system to validate peer-assigned essay scores [8] or a GPT-4- based system for automatically grading writing assessments [9]. These systems would be significantly enhanced by the incorporation of effective feedback for learners. Recent evidence has demonstrated the feasibility of providing constructive feedback to learners through ChatGPT [10]. Consequently, this study proposes a novel approach for learner assessment and assistance in MOOCs. The remaining sections are structured as follows: Section 2 presents an initial pipeline design for the automatic generation of test questions, and that is focused on their subsequent evaluation to provide valuable feedback for learners. Section 3 contains a conclusion of the proposal. 2. Pipeline design The primary objective of the proposed system is to minimize the impact of the aforementioned issues, with particular emphasis on engagement. As previously stated, LLM-powered solutions have the potential to enhance learning outcomes in MOOCs. However, it is crucial to integrate them carefully to avoid any adverse effects on engagement. To achieve this, learners will have access to tailored AI-generated questions and feedback that will assist them in determining the specific contents they should review. In general, this process would be laborious when undertaken manually. However, it is anticipated that AI will reduce human effort. The procedure for obtaining a robust system is illustrated in Figure 1, which outlines the following steps: 1. The initial stage of the process involves the implementation of a heuristic, rule-based system. This first iteration is constructed upon a pipeline that is capable of automatically generating multiple-choice questions (MQC-AI), and which is explained in further detail in Subsection 2.1. Based on the content of a course, MCQ-AI will generate a series of questions, which will then be presented to the learner for response, and by comparing their answers with the ground truth, basic AI-generated recommendations will be provided for each failed question. As a global evaluation, learners will be presented with an aggregation of individual recommendations, in addition to a recommendation of the course contents that they should review. 2. The second iteration will attempt to enhance the aforementioned system for automated feedback generation by capitalizing on the human knowledge provided by real instructors through playtesting. In a controlled environment, instructors will be presented with real examples of learner responses, which they will then provide with tailored feedback. Subsequently, the human-generated feedback will be used to fine- tune an LLM to provide a global evaluation from an aggregation of individual feedback of failed questions. Furthermore, learners will be provided with recommendations regarding the course contents that they are required to review. 3. The third step is focused on text input problems (TIP). As in the previous iterations, the LLM responsible for providing feedback will be enhanced with human knowledge through playtesting. However, it will now be used to also provide feedback for text input problems. In the previous iterations, multiple-choice questions were generated with the assistance of an LLM (MCQ-AI), but this step involves a different LLM, specifically fine-tuned to generate text input problems (TIP-AI). This model will generate question-answer pairs and will be explained in greater detail in Subsection 2.2. Then, embedding representation features can be used to evaluate answers through comparison using semantic similarity. Finally, the AI model trained for recommendation generation can be used to provide a global evaluation to learners. 4. The fourth stage of this process involves the generation of mathematical problems using an LLM. LLMs demonstrate robust arithmetic capabilities and logical reasoning, yet they have limited abilities in mathematical and abstract reasoning, and fails on graduate-level problems [11], [12]. While these issues present a significant challenge, some examples have shown neural network capabilities in solving, explaining, and generating university-level mathematical problems [13]. A similar approach can be employed to fine-tune on code an LLM that has been pretrained on text. This can be enhanced with general scientific knowledge [14] and well-known structures on how to solve mathematical problems [15], as well as with prompt engineering techniques, such as Chain-of-Thought (CoT), which has demonstrated a favorable performance for mathematical problem solving and complex reasoning [16]. The resulting LLM is expected to generate university-level mathematical problems of different fields (e.g., mathematics, physics, chemistry, and programming) with a well-explained solution that can be used to provide recommendations to learners. Figure 1: Procedure for creating a system for Question Generation and subsequent automatic correction. Learners will be presented with AI-generated questions. Once they have answered, a customized feedback message will be generated. The models responsible for evaluation and feedback generation will be fine-tuned using real data obtained from playtesting. 5. The final iteration is focused on the problem of Visual Question Generation (VQG), which is described as the task of asking a natural and engaging question when shown an image using AI models [17]. Following a similar approach, image-to-text models will be used to generate questions for learning environments. One example is the novel LLaVA-1.5, a Large Multimodal Model (LMM). LLaVA-1.5 has demonstrated efficacy in the field of Visual Question Answering [18], and it is anticipated that it will perform similarly well in the Visual Question Generation task. Then, as in previous iterations, the Feedback Generation system will be used to compare learners’ answers with the descriptions of the images. The following subsections provide a detailed description of the Multiple-Choice Question generation system (MCQ-AI), the Text Input Problem generation system (TIP-AI), and the Feedback Generation system (FG-AI). 2.1. Multiple-Choice Question generation system (MCQ-AI) MCQ-AI comprises a series of Deep Learning models, arranged in sequence, which are used to generate multiple-choice questions. At the beginning, T5 models were employed to solve the problem in English [19], and then they were subsequently replaced by mT5 models to implement the pipeline in Spanish [20]. These LLMs, designated as instruction-based models, demonstrate superior performance when compared to previous models (e.g., BERT/GPT/T5 families) in terms of outcomes, even when they have less complex configurations [6]. The ability of these models to replace the ones used in the pipeline will be studied accordingly. The initial pipeline operates as follows: 1. The course content will be divided into different paragraphs. Initially, they were obtained using sent_tokenize, a library for dividing text into sentences. However, this approach is not optimal. The issue will be addressed by means of semantic chunking, which entails dividing a text into sentences, comparing the semantic similarity (e.g., cosine similarity) of each sentence with the others, and then grouping sentences with the most similar embeddings together. As a result, meaningful chunks are obtained. 2. From each paragraph, answers are extracted using a T5 model (T5-AE). Subsequently, a second T5 model (T5-QG) is employed to generate the corresponding questions based on the same paragraph and the related answers. Finally, a third T5 model (T5-DG) is utilized to generate the distractors, i.e., the incorrect answers in multiple-choice questions. Although the pipeline is effective, instruction-based LLMs have demonstrated superior performance, suggesting that their use will result in a simplification and improvement of the original pipeline. To fine-tune models such as GPT-4 [21] or LLaMA 2 [22], different datasets will be employed, including SQuAD [23], which is focused on enhancing learners’ reading comprehension, and HotpotQA [24], to improve reasoning across course contents. In this preliminary study, we were granted access to LLaMA 3 [25]. The efficacy of this LLM can be assessed by examining Figure 2, where a prototype of the MCQ system is presented. Figure 2: A preliminary model of the MCQ system, implemented with LLaMA 3. 2.2. Text Input Problem generation system (TIP-AI) TIP-AI is similar to the MCQ-AI system previously described. In this case, the system is focused on Text Input Problem generation. The TIP-AI system structure is as follows: 1. The initial step is analogous to MCQ-AI Step 1. 2. An LLM will generate a series of questions related to each semantic chunk. Subsequently, the real answer (semantic chunk) will be stored in a vector representation (embeddings), along with each question generated for the chunk. This results in the creation of different question-answer pairs. Embeddings allow comparisons between the real answer and learners’ answers, which is required to determine whether an answer is correct or, alternatively, whether the learner needs further feedback. To achieve this, instruction-based LLMs fine-tuned with datasets such as TQA [26] will be employed. TQA contains a variety of materials, suitable for the text input problem generation task, as well as for others, including mathematical and visual question generation tasks. 2.3. Feedback Generation system (FG-AI) FG-AI is designed to assist users in their learning process and to guide them in reviewing course content where they may have performed less effectively. Upon receipt of an answer from a learner to a question of the types mentioned above, the following process will be initiated: 1. An automated correction system will evaluate the answer and determine its correctness. This system may be rule-based (e.g., MCQ), or it may employ a more complex evaluation method, such as semantic similarity (e.g., TIP). If the answer is not entirely accurate, an LLM will generate tailored feedback based on the failed question. The model will take into consideration the question, context, correct answer, and distractors, when applicable, as well as the learner’s answer. It will then evaluate why the answer is not entirely correct and provide some concept clarification for the learner. An overview of the various potential evaluation and feedback generation frameworks are given below. a. The rule-based system is relatively straightforward. Thanks to prompt engineering techniques, a prompt can be developed to generate MCQs with a structured format. This allows the question, possible answers and the correct answer to be processed programmatically. The system then displays the question, allowing the learner to interact with it. This is demonstrated in the example of Figure 2. Then, the result of the interaction can be compared to the correct answer. If the response is correct, no further feedback is needed. Otherwise, an LLM will receive the context, the question, the correct answer, and the learner’s response. Based on this information, the LLM will generate tailored feedback for the incorrect response. Figure 3 illustrates a preliminary test of this concept. b. Evaluation and feedback generation for TIP cannot be addressed in the same manner as that of MCQs. In this case, it is not as straightforward to provide an accurate response as in the previous instance, since the same idea can be expressed in a variety of ways with natural language. To address the evaluation and feedback generation problem, three possible solutions can be considered: i. Semantic similarity mechanisms: as previously stated, semantic similarity mechanisms, such as cosine similarity, can be employed to compare a given answer to the ground truth. Based on the model’s confidence, an LLM will generate recommendations for the learner. ii. Template-based recommendations: Swope et al. propose the creation of templates based on a scale to evaluate learners’ answers [27]. Each step involves the examination of different features within the learner’s answer to ascertain whether they align with the characteristics in the question. An illustrative example is provided below: 1: The response provided by the learner is not related to the topic at hand. 2: The response provided by the learner is partially related to the topic, but lacks some key aspects (key aspect 1, key aspect 2, …, key aspect n). 3: The response provided by the learner is directly related to the topic and provides a comprehensive overview of the key aspects. iii. FACTSCORE: FACTSCORE is a novel evaluation method which breaks texts into a series of atomic facts to facilitate comparison [28]. By leveraging the neural mechanisms elucidated in the work, atomic facts can be generated to evaluate learners’ answers and provide valuable feedback from those atomic facts. c. With regard to the mathematical problems and questions generated from images, the current state of the analysis is still in an initial stage. Consequently, further investigation is required to develop a framework that will prove to be of value. Figure 3: The example illustrates the process of generating feedback based on MCQs. In this case, the data from a real test will be recollected and transformed into a prompt for the generation of valuable feedback. In other cases, the process is analogous but involves alterations in the required data and the manner of its processing, which ultimately yields a prompt suitable for providing feedback. For instance, in the case of TIPs, the data processing step could involve transforming the FACTSCORE into a template that the LLM can use to validate the result. 2. Upon completion of a learning unit, every feedback generated during that specific unit for the learner will be gathered and processed by an LLM to identify critical misconceptions. The LLM will then provide a global evaluation and suggest to the learner the content that is required to review. The goal of this process is to increase engagement and improve the learning experience of learners, prompting them to interact with MOOCs. To operationalize this concept, instruction-based LLMs will be employed, which will be fine- tuned with existing data from several courses. In addition, other sources will be used, such as data obtained through playtesting with real instructors. 3. Conclusions The enhancement of learners’ learning processes in MOOCs is a topic that has been the subject of intense study. Advances in Artificial Intelligence offer the potential for significant improvements, as evidenced by previous examples. However, there is still much room for improvement. The system proposed is expected to enhance existing results and provide a valuable AI-powered tool for various courses, improving engagement and learning experience. Nevertheless, due to the central role of Artificial Intelligence in this solution, the deployment of a system similar to the one proposed in this paper requires careful consideration. It is well- documented that artificial intelligence is prone to hallucination, despite the efforts of researchers to avoid this behavior. This implies that false positives and non-desired outputs may be generated. In addition, to enhance accessibility and usability, learners must have the option of skipping questions generated by this system and also to report incorrect generated recommendations. This will help to ensure that they remain engaged with the learning material, even when they encounter erroneous AI-generated content. Acknowledgements The authors want to thank the support of the MISTI project “Using language models and chatbots for building virtual assistants in MOOCs”. References [1] C. R. Glass, M. S. Shiokawa-Baklan, and A. J. Saltarelli, “Who Takes MOOCs?,” New Dir. Institutional Res., vol. 2015, no. 167, pp. 41–55, 2016, doi: 10.1002/ir.20153. [2] M. Perdue, “Incorporating Learner Perspectives into Course Design,” in 2023 IEEE Learning with MOOCS (LWMOOCS), Oct. 2023, pp. 1–7. doi: 10.1109/LWMOOCS58322.2023.10306167. [3] C. Gütl, R. H. Rizzardini, V. Chang, and M. Morales, “Attrition in MOOC: Lessons Learned from Drop-Out Students,” in Learning Technology for Education in Cloud. MOOC and Big Data, L. Uden, J. Sinclair, Y.-H. Tao, and D. Liberona, Eds., Cham: Springer International Publishing, 2014, pp. 37–48. doi: 10.1007/978-3-319-10671-7_4. [4] D. F. O. Onah, J. Sinclair, and R. Boyatt, “Dropout Rates of Massive Open Online Courses: Behavioural Patterns,” in EDULEARN14 Proceedings, in 6th International Conference on Education and New Learning Technologies. Barcelona, Spain: IATED, Jul. 2014, pp. 5825– 5834. [5] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. doi: 10.48550/arXiv.1706.03762. [6] L. Ouyang et al., “Training language models to follow instructions with human feedback.” arXiv, Mar. 04, 2022. doi: 10.48550/arXiv.2203.02155. [7] A. Nie et al., “The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances.” OSF, Apr. 25, 2024. doi: 10.31219/osf.io/qy8zd. [8] W. Morris, S. Crossley, L. Holmes, and A. Trumbore, “Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs),” in LAK23: 13th International Learning Analytics and Knowledge Conference, in LAK2023. New York, NY, USA: Association for Computing Machinery, Mar. 2023, pp. 315–323. doi: 10.1145/3576050.3576098. [9] S. Golchin, N. Garuda, C. Impey, and M. Wenger, “Large Language Models As MOOCs Graders.” arXiv, Feb. 29, 2024. doi: 10.48550/arXiv.2402.03776. [10] W. Dai et al., “Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT,” in 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Jul. 2023, pp. 323–325. doi: 10.1109/ICALT58122.2023.00100. [11] Y. Chang et al., “A Survey on Evaluation of Large Language Models,” ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, p. 39:1-39:45, Mar. 2024, doi: 10.1145/3641289. [12] S. Frieder et al., “Mathematical Capabilities of ChatGPT,” Adv. Neural Inf. Process. Syst., vol. 36, pp. 27699–27744, Dec. 2023. [13] I. Drori et al., “A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level,” Proc. Natl. Acad. Sci., vol. 119, no. 32, p. e2123433119, Aug. 2022, doi: 10.1073/pnas.2123433119. [14] R. Taylor et al., “Galactica: A Large Language Model for Science.” arXiv, Nov. 16, 2022. doi: 10.48550/arXiv.2211.09085. [15] D. Hendrycks et al., “Measuring Mathematical Problem Solving With the MATH Dataset.” arXiv, Nov. 08, 2021. doi: 10.48550/arXiv.2103.03874. [16] J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv, Jan. 10, 2023. doi: 10.48550/arXiv.2201.11903. [17] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende, “Generating Natural Questions About an Image,” arXiv.org. Accessed: May 09, 2024. [Online]. Available: https://arxiv.org/abs/1603.06059v3 [18] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved Baselines with Visual Instruction Tuning.” arXiv, May 15, 2024. doi: 10.48550/arXiv.2310.03744. [19] R. Rodriguez-Torrealba, E. Garcia-Lopez, and A. Garcia-Cabot, “End-to-End generation of Multiple-Choice questions using Text-to-Text transfer Transformer models,” Expert Syst. Appl., vol. 208, p. 118258, Dec. 2022, doi: 10.1016/j.eswa.2022.118258. [20] D. De-Fitero-Dominguez, E. Garcia-Lopez, A. Garcia-Cabot, J.-A. Del-Hoyo-Gabaldon, and A. Moreno-Cediel, “Distractor Generation Through Text-to-Text Transformer Models,” IEEE Access, vol. 12, pp. 25580–25589, 2024, doi: 10.1109/ACCESS.2024.3361673. [21] OpenAI, “GPT-4 Technical Report.” arXiv, Mar. 27, 2023. doi: 10.48550/arXiv.2303.08774. [22] H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv, Jul. 19, 2023. doi: 10.48550/arXiv.2307.09288. [23] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” 2016, doi: 10.48550/ARXIV.1606.05250. [24] Z. Yang et al., “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.” arXiv, Sep. 25, 2018. doi: 10.48550/arXiv.1809.09600. [25] “Meta Llama 3,” Meta Llama. Accessed: May 15, 2024. [Online]. Available: https://llama.meta.com/llama3/ [26] A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi, “Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension,” 2017 IEEE Conf. Comput. Vis. Pattern Recognit. CVPR, pp. 5376–5384, Jul. 2017, doi: 10.1109/CVPR.2017.571. [27] J. Swope, “2024-05-20 Educators WG: AI Powered Assessment - Open edX Community - Open edX Community Wiki.” Accessed: Jun. 10, 2024. [Online]. Available: https://openedx.atlassian.net/wiki/spaces/COMM/pages/4246667265/2024-05- 20+Educators+WG+AI+Powered+Assessment [28] S. Min et al., “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.” arXiv, Oct. 11, 2023. doi: 10.48550/arXiv.2305.14251.