<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>C. Q. Zhao);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Megan Gu</string-name>
          <email>megangu@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chloe Qianhui Zhao</string-name>
          <email>cqzhao@cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire Liu</string-name>
          <email>claireli@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikhil Patel</string-name>
          <email>nikhil@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jahnvi Shah</string-name>
          <email>jahnvis@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jionghao Lin</string-name>
          <email>jionghao@hku.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kenneth R. Koedinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>5000 Forbes Ave, Pittsburgh, PA, 15213</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Monash University</institution>
          ,
          <addr-line>Wellington Rd, Clayton VIC 3800</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The University of Hong Kong</institution>
          ,
          <addr-line>Pokfulam Rd, Hong Kong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy helping students manage inequity showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;qualitative analysis</kwd>
        <kwd>large language models</kwd>
        <kwd>dialogue analysis</kwd>
        <kwd>feedback</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Background</title>
      <p>
        Tutoring is widely recognized as one of the most effective forms of personalized learning support [
        <xref ref-type="bibr" rid="ref1 ref2">1,
2</xref>
        ]. Within tutoring sessions, strategies such as praising student effort and providing feedback play a
critical role in enhancing student learning outcomes [3, 4]. When effectively employed, these
strategies can support students’ cognitive development, meet their emotional needs, and foster a
positive learning environment. For example, a well-placed praise such as “You are making great
progress on this problem” (rather than generic praise like “Good job”) can emphasize the importance
of the learning process, building student resilience and motivation [3]. Understanding how these
tutoring strategies are employed during sessions is crucial, as it highlights whether they align with
desired practices and are delivered in a manner that promotes student growth [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, the
ability to automate this analysis has been constrained by the limitations of earlier natural language
processing (NLP) tools, leaving room for significant improvements. Recent advancements in large
language models (LLMs) offer a promising opportunity to develop automated systems for analyzing
tutoring dialogues. These models (e.g., ChatGPT and Llama), with their ability to process and
understand complex language patterns, provide a promising avenue for evaluating tutoring
strategies in a nuanced and context-aware manner. To analyze the dialogue transcripts, our study
leverages LLMs to develop an automated system (Figure 1), accessible via
https://tutordialogue.vercel.app/dashboard/transcripts.
      </p>
      <p>The system is designed to detect the use of tutoring strategies and assess whether they are
employed in their desired form. It allows users to upload a spreadsheet containing dialogue
transcripts, with each line of dialogue and its corresponding speaker specified. As shown in Figure
1, for each strategy detected, the system determines whether it was used effectively (good) or
ineffectively (bad), and this information is presented in a color-coded format for easy interpretation:
blue indicates effective use ( good example), while red indicates ineffective use (bad example).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>2.1. Data
Our study used the dataset provided from the Teacher-Student Chatroom Corpus [5]. The dataset
contains one-on-one English lessons in an online chatroom. It was released in 2022, and contains a
total of 262 transcriptions. Then, we hired 4 annotators to annotate a total of 9 transcriptions for
usage of 5 different tutoring strategies. In our annotation scheme, we assigned the following labels
to each instance: &lt;-1&gt; when the tutoring strategy was not applicable, &lt;0&gt; when the tutoring strategy
was used undesirably, and &lt;1&gt; when the tutoring strategy was used by the tutor in a desired manner.
2.2. Prompt Engineering
We used few-shot chain-of-thought prompting for each of the five tutoring strategies: (1) Giving
Effective Praise, (2) Reacting to Errors, (3) Determining what students know, (4) Helping Students
Manage Inequity, and (5) Responding to Negative Self-Talk. These tutoring strategies generally
encourage students to persevere and increase their engagement, which are drawn from the PLUS
Tutors Platform, https://www.tutors.plus/en/solution/training.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>Our study used the GPT-3.5 model to detect and classify tutoring strategies through few-shot
prompting. Table 1 presents the accuracy of GPT-3.5 in identifying and classifying five tutoring
strategies, measured by True Negative Rate (TNR) and Recall. GPT-3.5 achieves moderate TNR
(0.655-0.738) but lower Recall (0.327-0.432). This suggests that the model performs somewhat
effectively at excluding incorrect classification, but still struggles with identifying the correct one
from the remaining two options. “Helping Students Manage Inequity” strategy achieves the highest
performance with TNR of 0.738 and Recall of 0.432, though overall performance remains limited.</p>
      <p>Further enhancements to our transcription analysis system will focus on incorporating more
advanced LLMs, providing detailed statistics and feedback based on the classification results,
reporting the frequency with which each strategy was used effectively or ineffectively and
generating overall feedback from the model. This feedback will evaluate the tutor’s effectiveness in
employing each strategy and offer suggestions for improvement.</p>
      <sec id="sec-3-1">
        <title>Acknowledgements</title>
        <p>This research was funded by the Richard King Mellon Foundation (Grant #10851) and the Learning
Engineering Virtual Institute (https://learning-engineering-virtual-institute.org/). The opinions,
findings, and conclusions expressed in this paper are those of the authors alone.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Declaration on Generative AI</title>
        <p>During the preparation of this work, the authors used GPT-3.5 to assist with classification of tutoring
strategies in dialogue data via few-shot prompting. The authors used GPT-4 for grammar and
spelling checks. After using GPT-4, the authors reviewed and edited the content as needed and take
full responsibility for the final publication.
[3] D. J. Royer, K. L. Lane, K. D. Dunlap, R. P. Ennis, A systematic review of teacher-delivered
behavior-specific praise on k–12 student performance, Remedial and Special Education 40 (2019)
112–128. doi:10.1177/0741932517751054.
[4] J. Lin, S. Singh, L. Sha, W. Tan, D. Lang, D. Gašević, G. Chen, Is it a good move? mining effective
tutoring strategies from human–human tutorial dialogues, Future Generation Computer
Systems 127 (2022) 194–207. doi:10.1016/j.future.2021.09.001.
[5] A. Caines, H. Yannakoudakis, H. Allen, P. Pérez-Paredes, B. Byrne, P. Buttery, The
teacherstudent chatroom corpus version 2: more lessons, new annotation, automatic detection of
sequence shifts, in: D. Alfter, E. Volodina, T. François, P. Desmet, F. Cornillie, A. Jönsson, E.
Rennes (Eds.), Proceedings of the 11th Workshop on NLP for Computer Assisted Language
Learning, LiU Electronic Press, Louvain-la-Neuve, Belgium, 2022, pp. 23–35. URL:
https://aclanthology.org/2022.nlp4call-1.3/.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nickow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Oreopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <article-title>The Impressive Effects of Tutoring on PreK-12 Learning: A Systematic Review and Meta-Analysis of the Experimental Evidence</article-title>
          , Working Paper 27476,
          <string-name>
            <surname>National</surname>
            <given-names>Bureau</given-names>
          </string-name>
          <source>of Economic Research</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .3386/w27476.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z. F.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            .
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gurung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Borchers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Koedinger</surname>
          </string-name>
          ,
          <article-title>Improving assessment of tutoring practices using retrieval-augmented generation</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402.14594. arXiv:
          <volume>2402</volume>
          .
          <fpage>14594</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>