1. Background

C. Q. Zhao);

Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation⋆

Megan Gu

megangu@andrew.cmu.edu 0

Chloe Qianhui Zhao

cqzhao@cmu.edu 0

Claire Liu

claireli@andrew.cmu.edu 0

Nikhil Patel

nikhil@andrew.cmu.edu 0

Jahnvi Shah

jahnvis@andrew.cmu.edu 0

Jionghao Lin

jionghao@hku.hk 0 1 2

Kenneth R. Koedinger

0 0 Carnegie Mellon University , 5000 Forbes Ave, Pittsburgh, PA, 15213 , United States 1 Monash University , Wellington Rd, Clayton VIC 3800 , Australia 2 The University of Hong Kong , Pokfulam Rd, Hong Kong , China

000 0 0003

Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy helping students manage inequity showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.

eol>qualitative analysis large language models dialogue analysis feedback

1. Background

Tutoring is widely recognized as one of the most effective forms of personalized learning support [ 1, 2 ]. Within tutoring sessions, strategies such as praising student effort and providing feedback play a critical role in enhancing student learning outcomes [3, 4]. When effectively employed, these strategies can support students’ cognitive development, meet their emotional needs, and foster a positive learning environment. For example, a well-placed praise such as “You are making great progress on this problem” (rather than generic praise like “Good job”) can emphasize the importance of the learning process, building student resilience and motivation [3]. Understanding how these tutoring strategies are employed during sessions is crucial, as it highlights whether they align with desired practices and are delivered in a manner that promotes student growth [ 2 ]. However, the ability to automate this analysis has been constrained by the limitations of earlier natural language processing (NLP) tools, leaving room for significant improvements. Recent advancements in large language models (LLMs) offer a promising opportunity to develop automated systems for analyzing tutoring dialogues. These models (e.g., ChatGPT and Llama), with their ability to process and understand complex language patterns, provide a promising avenue for evaluating tutoring strategies in a nuanced and context-aware manner. To analyze the dialogue transcripts, our study leverages LLMs to develop an automated system (Figure 1), accessible via https://tutordialogue.vercel.app/dashboard/transcripts.

The system is designed to detect the use of tutoring strategies and assess whether they are employed in their desired form. It allows users to upload a spreadsheet containing dialogue transcripts, with each line of dialogue and its corresponding speaker specified. As shown in Figure 1, for each strategy detected, the system determines whether it was used effectively (good) or ineffectively (bad), and this information is presented in a color-coded format for easy interpretation: blue indicates effective use ( good example), while red indicates ineffective use (bad example).

2. Method

2.1. Data Our study used the dataset provided from the Teacher-Student Chatroom Corpus [5]. The dataset contains one-on-one English lessons in an online chatroom. It was released in 2022, and contains a total of 262 transcriptions. Then, we hired 4 annotators to annotate a total of 9 transcriptions for usage of 5 different tutoring strategies. In our annotation scheme, we assigned the following labels to each instance: <-1> when the tutoring strategy was not applicable, <0> when the tutoring strategy was used undesirably, and <1> when the tutoring strategy was used by the tutor in a desired manner. 2.2. Prompt Engineering We used few-shot chain-of-thought prompting for each of the five tutoring strategies: (1) Giving Effective Praise, (2) Reacting to Errors, (3) Determining what students know, (4) Helping Students Manage Inequity, and (5) Responding to Negative Self-Talk. These tutoring strategies generally encourage students to persevere and increase their engagement, which are drawn from the PLUS Tutors Platform, https://www.tutors.plus/en/solution/training.

3. Results

Our study used the GPT-3.5 model to detect and classify tutoring strategies through few-shot prompting. Table 1 presents the accuracy of GPT-3.5 in identifying and classifying five tutoring strategies, measured by True Negative Rate (TNR) and Recall. GPT-3.5 achieves moderate TNR (0.655-0.738) but lower Recall (0.327-0.432). This suggests that the model performs somewhat effectively at excluding incorrect classification, but still struggles with identifying the correct one from the remaining two options. “Helping Students Manage Inequity” strategy achieves the highest performance with TNR of 0.738 and Recall of 0.432, though overall performance remains limited.

Further enhancements to our transcription analysis system will focus on incorporating more advanced LLMs, providing detailed statistics and feedback based on the classification results, reporting the frequency with which each strategy was used effectively or ineffectively and generating overall feedback from the model. This feedback will evaluate the tutor’s effectiveness in employing each strategy and offer suggestions for improvement.

Acknowledgements

This research was funded by the Richard King Mellon Foundation (Grant #10851) and the Learning Engineering Virtual Institute (https://learning-engineering-virtual-institute.org/). The opinions, findings, and conclusions expressed in this paper are those of the authors alone.

Declaration on Generative AI

During the preparation of this work, the authors used GPT-3.5 to assist with classification of tutoring strategies in dialogue data via few-shot prompting. The authors used GPT-4 for grammar and spelling checks. After using GPT-4, the authors reviewed and edited the content as needed and take full responsibility for the final publication. [3] D. J. Royer, K. L. Lane, K. D. Dunlap, R. P. Ennis, A systematic review of teacher-delivered behavior-specific praise on k–12 student performance, Remedial and Special Education 40 (2019) 112–128. doi:10.1177/0741932517751054. [4] J. Lin, S. Singh, L. Sha, W. Tan, D. Lang, D. Gašević, G. Chen, Is it a good move? mining effective tutoring strategies from human–human tutorial dialogues, Future Generation Computer Systems 127 (2022) 194–207. doi:10.1016/j.future.2021.09.001. [5] A. Caines, H. Yannakoudakis, H. Allen, P. Pérez-Paredes, B. Byrne, P. Buttery, The teacherstudent chatroom corpus version 2: more lessons, new annotation, automatic detection of sequence shifts, in: D. Alfter, E. Volodina, T. François, P. Desmet, F. Cornillie, A. Jönsson, E. Rennes (Eds.), Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning, LiU Electronic Press, Louvain-la-Neuve, Belgium, 2022, pp. 23–35. URL: https://aclanthology.org/2022.nlp4call-1.3/.

[1]

Nickow ,

Oreopoulos ,

Quan , The Impressive Effects of Tutoring on PreK-12 Learning: A Systematic Review and Meta-Analysis of the Experimental Evidence , Working Paper 27476, National

Bureau

of Economic Research , 2020 . doi: 10 .3386/w27476.

[2]

Z. F.

Han , J . Lin ,

Gurung ,

D. R.

Thomas ,

Chen ,

Borchers ,

Gupta ,

K. R.

Koedinger , Improving assessment of tutoring practices using retrieval-augmented generation , 2024 . URL: https://arxiv.org/abs/2402.14594. arXiv: 2402 . 14594 .