<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paulo Shakarian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhinav Koyyalamudi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noel Ngu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lakshmivihari Mareedu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arizona State University</institution>
          ,
          <addr-line>699 S Mill Ave, Tempe, AZ, 85281</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Math Word Problems</kwd>
        <kwd>ChatGPT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The emergence of large language models (LLM) has gained much popularity in recent years.
At the time of this writing, some consider OpenAI’s GPT 3.5 series models as the state-of-the
art [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular, a variant tuned for natural dialogue known as ChatGPT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], released in
November 2022 by OpenAI, has gathered much popular interest, gaining over one million users
in a single week [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, in terms of accuracy, LLMs are known to have performance issues,
specifically when reasoning tasks are involved [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. This issue, combined with the ubiquity of
such models has led to work on prompt generation and other aspects of the input [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Other
areas of machine learning, such as meta-learning [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] and introspection [9, 10] attempt to
predict when a model will succeed or fail for a given input. An introspective tool, especially for
certain tasks, could serve as a front-end to an LLM in a given application.
      </p>
      <p>
        As a step toward such a tool, we investigate aspects of math word problems (MWPs) that
can indicate the success or failure of ChatGPT on such problems. We found that ChatGPT’s
performance changes dramatically based on the requirement to show its work, failing 20% of
the time when it provides work compared with 84% when it does not. Further several factors
about MWPs can lead to a higher probability of failure when compared with the prior,
specifically noting that the probability of failure increases linearly with the number of addition and
subtraction operations (across all experiments). We also have released the dataset of ChatGPT’s
responses to the MWPs to support further work on the characterization of LLM performance.
While there has been previous work examining the LLM performance on MWPs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], such work
did not investigate specific aspects that increase MWP dificulty nor did it examine performance
on ChatGPT in particular.
      </p>
      <p>The remainder of this paper proceeds as follows. In Section 2, we describe our methodology.
Then we describe our results in Section 3. Using these intuitions, we present baseline models to
predict the performance of ChatGPT in Section 4. This is followed by a discussion of related
work (Section 5) and future work (Section 6).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>MWP Dataset. In our study, we employed the DRAW-1K dataset [11, 12, 13] which not only
includes 1,000 MWPs with associated answers but also template algebraic equations that one
would use to solve such a word problem. As a running example, consider the following MWP.</p>
      <p>One whole number is three times a second. If 20 is added to the smaller number, the
result is 6 more than the larger.</p>
      <p>We show ChatGPT’s (incorrect) response to this MWP in Figure 1. The DRAW-1K dataset not
only includes the correct answer, which in this case is 12 and 7 but also includes template
equations used to solve the problem. For our running example, this consists of the equations
 −  =  −  and  ×  −  = 0. This information represents a symbolic representation of
the problem which can potentially be used to identify aspects that make such problems more
dificult.</p>
      <p>Entering Problems into ChatGPT at Scale. At the time of our study, OpenAI, the maker
of ChatGPT had not released an API. However, using the ChatGPT CLI Python Wrapper1 we
interfaced with ChatGPT allowing us to enter the MWP’s at scale. For the first two experiments,
we would add additional phrases to force ChatGPT to show only the final answer. We developed
these additions to the prompt based on queries to ChatGPT to generate the most appropriate
phrase. However, we found in our third experiment that this addition impacted results. We ran
multiple experiments to test ChatGPT’s ability with these problems.</p>
      <p>• January 2023 Experiment (No work). Our first experiment was run in early January
2023 prior to OpenAI’s announcement of improved performance on mathematical tasks
on January 30, 20232 and in this experiment we included the following statement as part
of the prompt.
1We used ChatGPT CLI Python Wrapper by Mahmoud Mabrouk, see https://github.com/mmabrouk/chatgpt-wrapper
2https://help.openai.com/en/articles/6825453-chatgpt-release-notes</p>
      <sec id="sec-2-1">
        <title>Don’t provide any work/explanation or any extra text. Just provide the final number of answers for the previous question, with absolutely no other text. if there are two or more answers provide them as a comma separated list of numbers.</title>
        <p>• February 2023 Experiment (No work). Our second experiment was run in
midFebruary 2023 after the aforementioned OpenAI announcement and also used a prompt
that would cause ChatGPT to show only the answer, however we found that our original
prompt led to more erratic behavior, so we modified the prompt for this experiment, and
used the following.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Don’t provide any work/explanation or any extra text. Just provide the final</title>
        <p>number of answers for the previous question, with absolutely no other text.
if there are two or more answers provide them as a comma separated list
of numbers like: ’10, 3,’ etc; or if there is only 1 answer provide it like ’10’.
Absolutely no other text just numbers alone. Just give me the numbers (one or
more) alone. No full stops, no spaces, no words, no slashes, absolutely nothing
extra except the 1 or more numbers you might have gotten as answers.
• February 2023 Experiment (Showing Work). We also repeated the February
experiment without the additional prompt, thereby allowing ChatGPT to show all its work. We
note that in this experiment we used ChatGPT Plus which allowed for faster response.
At the time of this writing, ChatGPT Plus is only thought to be an improvement to
accessibility and not a diferent model. 3
3https://openai.com/blog/chatgpt-plus/</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>The key results of this paper are as follows: (1.) the creation of a dataset consisting of ChatGPT
responses to the MWPs, (2.) identification of ChatGPT failure rates ( 84% for January and
February experiments with no work and 20% for the February experiment with work), (3.)
identification of several factors about MWPs relating to the number of unknowns and number of
operations that lead to a higher probability of failure when compared with the prior (Figure 3),
(4.) identification that the probability of failure increases linearly with the number of addition
and subtraction operations (Figure 5), and (5.) identification of a strong linear relationship
between the number of multiplication and division operations and the probability of failure in
the case where ChatGPT shows its work.</p>
      <p>Dataset. We have released ChatGPT’s responses to the 1,000 DRAW-1K MWP’s for general use
at https://github.com/lab-v2/ChatGPT_MWP_eval. We believe that researchers studying
this dataset can work to develop models that can combine variables, operate directly on the
symbolic template, or even identify aspects of the template from the problem itself in order to
predict LLM performance. We note that at the time of this writing, collecting data at scale from
ChatGPT is a barrier to such work as API’s are not currently directly accessible, so this dataset
can facilitate such ongoing research without the overhead of data collection.
Overall Performance of ChatGPT on DRAW-1K. As DRAW-1K provides precise can
complete answers for each problem, we classified ChatGPT responses in several diferent ways and
the percentage of responses in each case is shown in Figure 2.</p>
      <p>1. Returns all answers correctly. Here ChatGPT returned all answers to the MWP (though it
may round sometimes).
2. Returns some answers correctly, but not all values. Here the MWP called for more than one
value, but ChatGPT only returned some of those values.
3. Returns “No Solution.” Here ChatGPT claims there was no solution to the problem. This
was not true for any of the problems.
4. Returns answers, but none are correct. Here ChatGPT returned no correct answers (e.g.,
see Figure 1).</p>
      <p>Throughout this paper, we shall refer to the probability of failure as the probability of cases 3
and 4 above (considered together). In our February experiment, we found that when ChatGPT
omitted work, the percentages, as reported in Figure 2 remained the same, though they difered
significantly when work was included. We also report actual numbers for all experiments
in Table 1. We note that the probability of failure increases significantly when the work is
not shown. However, when the work is included, ChatGPT obtains performance in line with
state-of-the-art models (i.e. EPT [18, 16]) which has a reported 59% accuracy while ChatGPT
(when work is shown) has fully correct (or rounded) answers 51% of the time, but can be viewed
as high as 80% if partially correct answers are included.</p>
      <p>Factors Leading to Incorrect Responses. We studied various factors from the templated
solutions provided for the MWP in the DRAW-1K dataset and these included number of equations,
number of unknowns, number of division and multiplication operations, number of addition and
subtraction operations, and other variants derived from the metadata in the DRAW-1K dataset.
We identified several factors that, when present, cause ChatGPT to fail with a probability greater
than the prior (when considering the lower bound of a 95% confidence interval). These results
are shown in Figure 3. One interesting aspect we noticed is that when the system would be
required to show its work, the number of unknowns present no longer seems to increase the
probability of failure (this was true for all quantities of unknowns in addition to what is shown in
Figure 3). Additionally, the number of multiplication and division operations, while increasing
the probability of failure greater than the prior in the January experiment was not significant
(based on 95% confidence intervals) in the February experiment (when work was not shown)
possibly a result of OpenAI’s improvements made at the end of January. However, there was
a significant relationship between the number of multiplication and division operations and
failure when work was shown. In fact, we found a strong linear relationship (2 = 0.802) for
this relationship in the case where work was shown.</p>
      <p>
        Correlation of failure with additions and subtractions. Previous work has remarked on
the failure of LLM’s in multi-step reasoning [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. In our study, we identified evidence of
this phenomenon. Specifically, we found a strong linear relationship between the number of
addition and subtraction operations with the probability of failure (2 = 0.821 for the January
experiment, 2 = 0.870 for the February experiment and 2 = 0.915 when work was shown).
We show this result in Figure 5. It is noteworthy that the relationship existed in all of our
experiments, and seemed to be strengthened when ChatGPT included work in the result.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Performance Prediction Baselines</title>
      <p>The results of the previous section, in particular, the factors indicating a greater probability of
failure (e.g. Figures 3-5), may indicate that the performance of ChatGPT can be predicted. In
this section, we use features obtained from the equations associated with the MWPs to predict
performance. Note that here we use ground-truth equations to derive the features, so the models
presented in this section are essentially using an oracle - we leave extracting such features
from equations returned by ChatGPT or another tool (e.g., EPT [18]) to future work. That said,
as these features deal with counts of operations, unknowns, and equations, a high degree of
accuracy in creating the equations would not be required to faithfully generate such features.</p>
      <p>Following the ideas of machine learning introspection [9, 10], we created performance
prediction models using random forest and XGBoost. We utilized scikit-learn 1.0.2 and XGBoost
1.6.2 respectively. In our experiments, we evaluated each model on each dataset using a five-fold
cross-validation and report average precision and recall in Table 2 (along with F1 computed
based on those averages). In general, our models were able to provide higher precision than
random on predicting incorrect answers for both classifiers. Further, XGBoost was shown to be
able to provide high recall for predicting correct responses. While these results are likely not
suitable for practical use, they do demonstrate that the features extracted provide some amount
of signal to predict performance and provide a baseline for further study.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>
        The goal of this challenge dataset is to develop methods to introspect a given MWP in order
to identify how an LLM (in this case ChatGPT) will perform. Recent research in this area has
examined MWPs can be solved by providing a step-by-step derivation [14, 15, 16, 17]. While
these approaches provide insight into potential errors that can lead to incorrect results, this
has not been studied in this prior work. Further, the methods of the aforementioned research
are specific to the algorithmic approach. Work resulting from the use of our challenge dataset
could lead to solutions that are agnostic to the underlying MWP solver - as we treat ChatGPT
as a black box. We also note that, if such eforts to introspect MWPs are successful, it would
likely complement a line of work dealing with “chain of thought reasoning” for LLMs [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]
which may inform better ways to generate MWP input into an LLM (e.g., an MWP with fewer
additions may be decomposed into smaller problems). While some of this work also studied
LLM performance on Math Word Problems (MWPs), it only looked at how various prompting
techniques could improve performance rather than underlying characteristics of the MWP that
leads to degraded performance of the LLM.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Understanding the performance of commercial black-box LLMs will be an important topic
as they will likely become widely used for both commercial and research purposes. Further
future directions would also include an examination of ChatGPT performance on datasets other
MWPs [13], investigating ChatGPT’s nondeterminism, and exploring these studies on upcoming
commercial LLM’s to be released by companies such as Alphabet and Meta.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Some of the authors have been funded by the ASU Fulton Schools of Engineering.
[9] S. Daftry, S. Zeng, J. A. Bagnell, M. Hebert, Introspective perception: Learning to predict
failures in vision systems, URL: http://arxiv.org/abs/1607.08665. doi:10.48550/arXiv.
1607.08665. arXiv:1607.08665 [cs].
[10] M. S. Ramanagopal, C. Anderson, R. Vasudevan, M. Johnson-Roberson,
Failing to learn: Autonomously identifying perception failures for self-driving cars 3
3860–3867. URL: http://arxiv.org/abs/1707.00051. doi:10.1109/LRA.2018.2857402.
arXiv:1707.00051 [cs].
[11] S. Upadhyay, M.-W. Chang, K.-W. Chang, W.-t. Yih, Learning from explicit and implicit
supervision jointly for algebra word problems, in: Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, pp. 297–306. URL: https://aclanthology.org/D16-1029. doi:10.18653/v1/
D16-1029.
[12] S. Upadhyay, M.-W. Chang, Annotating derivations: A new evaluation strategy and dataset
for algebra word problems, URL: http://arxiv.org/abs/1609.07197. doi:10.48550/arXiv.
1609.07197.
[13] Y. Lan, L. Wang, Q. Zhang, Y. Lan, B. T. Dai, Y. Wang, D. Zhang, E.-P. Lim, MWPToolkit:
An open-source framework for deep learning-based math word problem solvers 36 13188–
13190. URL: https://ojs.aaai.org/index.php/AAAI/article/view/21723. doi:10.1609/aaai.
v36i11.21723, number: 11.
[14] Z. Gong, K. Zhou, X. Zhao, J. Sha, S. Wang, J.-R. Wen, Continual pre-training of language
models for math problem understanding with syntax-aware memory network, in:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, pp. 5923–5933. URL:
https://aclanthology.org/2022.acl-long.408. doi:10.18653/v1/2022.acl-long.408.
[15] K. S. Ki, D. Lee, B. Kim, G. Gweon, Generating equation by utilizing operators : GEO
model, in: Proceedings of the 28th International Conference on Computational
Linguistics, International Committee on Computational Linguistics, pp. 426–436. URL: https:
//aclanthology.org/2020.coling-main.38. doi:10.18653/v1/2020.coling-main.38.
[16] B. Kim, K. S. Ki, S. Rhim, G. Gweon, EPT-x: An expression-pointer transformer model
that generates eXplanations for numbers, in: Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), Association for
Computational Linguistics, pp. 4442–4458. URL: https://aclanthology.org/2022.acl-long.305.
doi:10.18653/v1/2022.acl-long.305.
[17] Y. Xia, F. Li, Q. Liu, L. Jin, Z. Zhang, X. Sun, L. Shao, ReasonFuse: Reason path driven and
global–local fusion network for numerical table-text question answering 516 169–181. URL:
https://www.sciencedirect.com/science/article/pii/S0925231222011444. doi:10.1016/j.
neucom.2022.09.046.
[18] B. Kim, K. S. Ki, D. Lee, G. Gweon, Point to the expression: Solving algebraic word problems
using the expression-pointer transformer model, in: Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing (EMNLP), Association for
Computational Linguistics, pp. 3768–3779. URL: https://aclanthology.org/2020.emnlp-main.308.
doi:10.18653/v1/2020.emnlp-main.308.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] How does gpt obtain its ability? tracing emergent abilities of language models to their sources</article-title>
          , URL: https://yaofu.notion.site.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chatgpt</surname>
          </string-name>
          :
          <article-title>Optimizing language models for dialogue</article-title>
          , URL: https://openai.com/blog/chatgpt/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Chatgpt gained 1 million users in under a week. here's why the AI chatbot is primed to disrupt search as we know it</article-title>
          . URL: https://www.yahoo.com/video/ chatgpt-gained-1
          <string-name>
            <surname>-</surname>
          </string-name>
          million-followers-
          <volume>224523258</volume>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          , E. Buchatskaya,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          , D. d. L.
          <string-name>
            <surname>Casas</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hendricks</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Welbl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hennigan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Noland</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Millican</surname>
            , G. v. d. Driessche,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Damoc</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Guy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Osindero</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Elsen</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Rae</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
          </string-name>
          , L. Sifre,
          <article-title>Training compute-optimal large language models</article-title>
          , URL: http://arxiv.org/abs/2203.15556. arXiv:
          <volume>2203</volume>
          .15556 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Selfconsistency improves chain of thought reasoning in language models</article-title>
          , URL: http://arxiv. org/abs/2203.11171. doi:
          <volume>10</volume>
          .48550/arXiv.2203.11171. arXiv:
          <volume>2203</volume>
          .11171 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hospedales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Antoniou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Micaelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Storkey,
          <source>Meta-learning in neural networks: A survey</source>
          <volume>44</volume>
          <fpage>5149</fpage>
          -
          <lpage>5169</lpage>
          . URL: https://www.computer.org/csdl/journal/tp/2022/09/09428530/ 1twaJR3AcJW. doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2021</year>
          .
          <volume>3079209</volume>
          , publisher: IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Loy</surname>
          </string-name>
          ,
          <article-title>Domain generalization: A survey 1-20</article-title>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2022</year>
          .
          <volume>3195549</volume>
          , conference Name:
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>