<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>B.C. Kuo); pedropcwu@gmail.com (P. C. Wu); chenhueiliao@gmail.com (C. H. Liao)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GPT-3.5, GPT-4, Bard, and Claude's Performance on the Chinese Reading Comprehension Test</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bor-Chen Kuo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pei-Chen Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chen-Huei Liao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Taichung University of Education</institution>
          ,
          <addr-line>No.140, Minsheng Rd., West Dist., Taichung City 403514</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>In this study, we explored the performance of advanced Generative AI models-GPT-3.5, GPT-4, Bard, and Claude-in Chinese reading comprehension tasks. Utilizing a fifth-grade Chinese reading comprehension test, which comprised 55 questions, we assessed the performances of these models in comparison with 491 fifthgrade students from Central Taiwan. The results showed that GPT-4 performed the best in the test and using level settings was more effective than not using them. Analysis of the level settings indicated noticeable differences between Level 1 and 2 for GPT and Bard, with less distinct variations observed between Level 2 and 3. In contrast, Claude exhibited minimal variation in results across all levels. The performance of the human students was similar to that of GPT-3.5, but not as that of high as the other models. For future research, we recommend employing a more nuanced design for prompts to better simulate the reading comprehension abilities of students of various ages, thereby further enhancing the educational applications of these models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language models</kwd>
        <kwd>reading comprehension</kwd>
        <kwd>pass rate 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
    </sec>
    <sec id="sec-2">
      <title>2. METHODS</title>
      <p>
        In this study, we employed the fifth-grade Chinese reading comprehension test developed by Prof.
Chen-Huei Liao's team at National Taichung University of Education [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as a test tool. This test was
used to evaluate the performance of various language models – GPT-3.5, GPT-4, Bard, and Claude –
in Chinese reading comprehension. Our goal was to determine how effectively these models simulate
reading comprehension across different levels and to compare their pass rates with those of human
students.
      </p>
      <p>The test consists of 55 questions, characterized by an average difficulty of 0.614, a discrimination
of 0.39, and a reliability of 0.899. It includes four question types: word and phrase, sentence, contextual
comprehension, and inference, covering six dimensions: phonological processing ability, vocabulary
comprehension, sentence comprehension, grammatical comprehension, contextual comprehension, and
inferential comprehension. The format is a four-option multiple-choice test.</p>
      <p>According to the research objectives, the following tasks will be carried out in this study:
1. T1：Evaluate the effects and performance of GPT, Bard, and Claude in Chinese reading
comprehension test with and without level settings.</p>
      <p>2. T2：Compare the performance of GPT, Bard, Claude, and human students in Chinese reading
comprehension test.</p>
      <p>2.1 T1 TEST
The purpose of this test was to address Research Questions 1 and 2 (RQ1 and RQ2), specifically to
evaluate the effects and response results of the model both with and without level settings. The aim
was to ascertain whether the model could effectively simulate reading comprehension test
performance for students at different levels. In this study, the levels were defined to represent various
age groups: Level 1 for grades 1 to 3, Level 2 for grades 4 to 6, and Level 3 for grades 7 to 9. The
initial test was conducted without a level setting. The same prompt was inputted into all four models,
with the prompt set as follows: 'You are now asked to do a reading comprehension test, please solve
the question, there are 55 questions in total, and they will be provided in batches.'</p>
      <p>We discovered that the model's effectiveness in answering the questions diminished when it was
given all 55 questions at once. The slower response speed could be attributed to the challenge of
processing a large amount of text simultaneously, which appeared to decrease its parsing ability and
increase the error rate in question-solving. Consequently, we decided to present 10-15 questions at a
time to the model and then calculated the pass rate by comparing the selected answers with the correct
ones.</p>
      <p>In the next phase of testing, which included level settings, all four models were given the same
prompt, intending to have each model simulate the reading comprehension level of students of
different grades. Taking Level 2 as an example, the content of the prompt was: 'You are now a Grade
4 - 6 student, and you are now asked to do a reading comprehension test based on the reading
comprehension skills you should have at your current level. There are 55 questions in total, in total,
and they will be provided in batches.' This approach was consistent with the previous one. We found
that if the model was tasked with answering all 55 questions at once, its effectiveness decreased. The
potential lower parsing ability when reading large texts at once could lead to a higher error rate in
solving the questions. Moreover, when simulating students of different grades, the results were nearly
identical for students in grade 4 and above, making it challenging to distinguish between the reading
comprehension abilities of students in different grades. Ultimately, we again opted to provide the
model with 10-15 questions at a time, recording the response options and the correct answers to
calculate the pass rate.</p>
      <p>
        2.2 T2 TEST
The objective of this test was to address Research Question 3 (RQ3), which aimed to compare the
performance of the model with that of human students on a Chinese reading comprehension test. The
model's response data were sourced from the T1 TEST. For human students’ response data utilized
in this study were obtained from Lin [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which involved the participation of 491 fifth-grade students
in Central Taiwan. This assessment was conducted using a paper-based format. After the testing, the
students' responses were digitized. The data were then subjected to a detailed analysis using
BILOGMG, culminating in the calculation of the average pass rate among the students, based on the results
of this analysis.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. RESULTS</title>
      <p>The results demonstrated that all four models exhibited improved performance with level setting
compared to without. GPT-4 emerged as the top performer, followed by Claude, then Bard, and finally
GPT-3.5, as illustrated in Table 1.
Note. With level Setting (Y) indicates the average pass rate of the level.</p>
      <p>According to Table 2, when GPT, Bard, and Claude are given the same prompt, the pass rates for GPT
and Bard exhibit notable variation at different levels, particularly between Level 1 and Level 2. In
contrast, Claude shows negligible variation (only a 1.82% difference between Level 1 and Level 2).
During the testing phase, the Claude model indicated that it cannot fully replicate the cognitive and
problem-solving abilities of students of a specific age group. However, it can attempt to solve problems
by employing basic vocabulary and knowledge suitable for that age group, complemented by relevant
assumptions and inferences. The final outcomes align with the initial descriptions provided by the
model.</p>
    </sec>
    <sec id="sec-4">
      <title>4. RESULTS</title>
      <p>The results of the study showed that GPT-4 performed the best on the test, with level setting being more
effective than without level setting. The analysis of the level setting revealed a more pronounced
difference between Level 1 and Level 2 for GPT and Bard, whereas the difference between Level 2 and</p>
      <sec id="sec-4-1">
        <title>Level Setting N Y N</title>
        <p>Y
N
Y
N
Y</p>
      </sec>
      <sec id="sec-4-2">
        <title>Level</title>
        <p>Level 1
Level 2
Level 3
Level 1
Level 2
Level 3
Level 1
Level 2
Level 3
Level 1
Level 2
Level 3
Level 3 was less marked. The performance of Claude in Level 1, 2, and 3 was similar. This suggests
that Claude was less adept in this capacity. The performance of the human students was similar to that
of GPT-3.5, but not as good as the other models.</p>
        <p>For future enhancements, in addition to fine-tuning the model, we can consider specifying the
reading comprehension abilities expected of students in different age groups when providing the
prompt. This strategy could more accurately align the model with the actual thinking and
problemsolving patterns of students across various age groups during simulation.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , vol.
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>14165</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seßler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Küchemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bannert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fischer</surname>
          </string-name>
          , et al.,
          <article-title>ChatGPT for good? On opportunities and challenges of large language models for education</article-title>
          ,
          <source>Learning and Individual Differences</source>
          <volume>103</volume>
          (
          <year>2023</year>
          )
          <article-title>102274</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.lindif.
          <year>2023</year>
          .
          <volume>102274</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Establishment of the computerized adaptive reading comprehension test for fifth grade students in elementary school</article-title>
          ,
          <source>Master's thesis</source>
          , National Taichung University of Education,
          <year>2014</year>
          . URL: https://hdl.handle.net/11296/z2xa8e.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>