=Paper=
{{Paper
|id=Vol-3781/paper03
|storemode=property
|title=Advantages and limitations of large language models in chemistry education: A comparative analysis of ChatGPT, Gemini and Copilot
|pdfUrl=https://ceur-ws.org/Vol-3781/paper03.pdf
|volume=Vol-3781
|authors=Yuliia V. Kharchenko,Olena M. Babenko
|dblpUrl=https://dblp.org/rec/conf/3lperson/KharchenkoB24
}}
==Advantages and limitations of large language models in chemistry education: A comparative analysis of ChatGPT, Gemini and Copilot==
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
Advantages and limitations of large language models in
chemistry education: A comparative analysis of ChatGPT,
Gemini and Copilot
Yuliia V. Kharchenko, Olena M. Babenko
Sumy State Pedagogical University named after A. S. Makarenko, 87 Romenska Str., Sumy, 40002, Ukraine
Abstract
This study aims to explore the potential and limitations of large language models (LLMs) such as ChatGPT,
Gemini, and Copilot, in the context of chemistry education. The primary objective of the study is to compare
the effectiveness of LLMs in solving chemistry tasks and to identify the key challenges associated with their
implementation in education. These LLMs were selected based on a survey of students which indicated their
widespread use due to their free accessibility. To evaluate the potential of LLMs in chemistry education, we
employed them to solve tasks corresponding to different levels of knowledge in different subfields of chemistry.
A comparative evaluation of LLMs’ performance against that of average Ukrainian students was conducted.
The results indicate that while LLMs show promise mainly in tasks not demanding deep logical reasoning,
they are generally inferior to students. Key challenges in using LLMs in chemistry education identified include
understanding the nuances of chemistry as a complex and multifaceted science, abstract concepts used in
chemistry, recognition of chemical compound formulas, chemical reaction equations, limitations in logical
reasoning, language barriers, and the occurrence of AI hallucinations. Additionally, there is a need for students
to develop skills in crafting effective queries and prompts to enhance the efficiency of working with LLM. While
LLMs are promising, their implementation requires addressing the identified limitations.
Keywords
artificial intelligence, LLM, ChatGPT, Gemini, Copilot, chemistry education
1. Introduction
A characteristic feature of modern times is the continuous dynamics of change in all spheres of human
life: economy, politics, science, and education. The dominant trend in the development of contemporary
civilization is its transition into an information society, wherein information and communication
technologies become the objects of human activity, providing all the necessary conditions for the
formation and development of the personality of the new formation. The rapid development of the
global Internet network has led to a computer revolution in the information world, where the computer
serves as the primary means of telecommunication. Considering that the current stage of development
of pedagogical science in the world is characterized by an intensive search for new ways to improve the
quality of education, information and communication technologies have become powerful tools in this
process [1]. Various forms of information and communication technologies (ICTs) have found active
application in education, ranging from electronic textbooks [2], online learning technologies [3, 4, 5],
mobile applications [6, 7] to augmented and virtual reality technologies [8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23]. The latest term in the digital revolution is artificial intelligence (AI) [24, 25],
including generative AI. Generative artificial intelligence (GenAI) is a technology that automatically
generates content in response to queries. GenAI actually creates new content using existing content. Its
output may encompass formats including all symbolic representations of human though: texts, images
3L-Person 2024: IX International Workshop on Professional Retraining and Life-Long Learning using ICT: Person-oriented Approach,
co-located with the 19th International Conference on ICT in Education, Research, and Industrial Applications (ICTERI 2024)
September 23, 2024, Lviv, Ukraine
" yuvlakhar@gmail.com (Y. V. Kharchenko); olena.babenko@sspu.edu.ua (O. M. Babenko)
~ https://scholar.google.com.ua/citations?user=zYiU4iMAAAAJ (Y. V. Kharchenko);
https://scholar.google.com.ua/citations?user=AeYIdfAAAAAJ (O. M. Babenko)
0000-0002-8960-2440 (Y. V. Kharchenko); 0000-0002-1416-2700 (O. M. Babenko)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
42
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
(including drawings, photographs, videos and animations), music and software code. GenAI learns from
data it collects from web pages, conversations in social networks, and other online media. It generates
its content by statistically analyzing the order of words, pixels, or other elements in the data it has
learned, and identifying and replicating common patterns (such as which words typically follow other
words, and in what order).
Text-based generative artificial intelligence, utilizing a type of artificial neural network known as
a general-purpose transformer, is particularly popular. This type of AI, often referred to as Large
Language Models (LLMs), is commonly known as a generative pre-trained transformer, or GPT.
GPTs and their ability to automatically generate text became available to the global research commu-
nity in 2018. The launch of ChatGPT in 2022, which offered free access and a user-friendly interface,
became a sensation and led to active searches and technological solutions for other companies to create
and launch new similar systems. By mid-2023, other alternatives to ChatGPT [26] became available,
most of which were free (within certain limits), as well as services for generating images, videos, and
music, some of which are listed in table 1.
Table 1
AI-based services.
Text GenAI models Image GenAI models Video GenAI models Music GenAI models
Alpaca Craiyon Ela Aiva
Bard (now Gemini) DALL-E GliaCloud Boomy
Chatsonic DreamStudio Pictory Soundraw
Ernie Fotor Runway Voicemod
Hugging Chat Midjourney
Jasper NightCafe
Llama Photosonic
Open Assistant
Tongyi Qianwen
YouChat
Many other tools based on the aforementioned LLMs are also emerging, such as ChatPDF, which can
work with PDF documents and analyze them, or Perplexity, which serves as a knowledge hub and helps
users find answers to queries based on their needs. Similarly, the process of integrating LLMs into other
products, such as web browsers, is ongoing. In Ukraine, users gained access to artificial intelligence
in 2023, and now the range of services based on generative artificial intelligence has significantly
expanded.
2. Literature review
The increasingly deeper and active penetration and integration of AI into human activity could not
fail to impact one of the most important components of human development, namely education.
Worldwide, the initial concerns about the use of AI in education were linked to fears that students
would use its capabilities primarily to cheat on academic tasks, thereby undermining the value of
educational assessment, certification, and qualifications [27]. A clear consequence of such concerns
was the prohibition by some educational institutions on students’ use of AI [28]. However, in other
institutions, a more optimistic approach was taken towards the use of AI [29], believing that it is more
progressive not to prohibit its use, but to provide support for both teachers and students in utilizing
tools based on generative AI [30]. Today, it should be noted that artificial intelligence is increasingly
being integrated into education with the aim of enhancing student learning efficiency and improving
teaching practices.
As indicated by the analysis of recent research, significant regarding the use of AI in the field of
education, particularly in higher education, in educational and research activities, are occurring in the
following key areas [31, 32]:
43
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
• assessment (including automatic assessment and evaluation of educational progress and students’
attitude to learning, individual and group assessment, etc.);
• predicting learning status (predicting student withdrawals, at-risk groups, innovative abilities,
career decisions), productivity or satisfaction, improving the learning experience;
• assistance (providing support to students in their educational pursuits, for example, anthropo-
morphic presence, which includes virtual agents and intervention through digital programs);
• tutoring (providing and supporting individual strategies and approaches to students, taking into
account their characteristics and needs);
• learning management (learning analytics, sequence of educational plans and programs, develop-
ment of instructions, and student allocation).
Educational tools based on artificial intelligence can offer personalized learning experiences, automate
routine tasks, and provide real-time feedback and assessment. As numerous studies by scientists around
the world have shown, AI has the potential to be a useful tool in teaching and learning Chemistry,
particularly for creating interactive simulations, answering questions, and providing feedback on
student work [33]. It can be used to create personalized learning experiences for students [34, 35].
The results of studying the attitudes of natural science teachers toward the use of artificial intelligence
in teaching show that teachers are generally positive about integrating AI into the educational process.
Key factors influencing their willingness to use AI include self-confidence, expected benefits, ease of
use and general attitude towards AI technologies. Al Darayseh [36] found in his study that the easier
teachers can integrate AI into teaching Natural Sciences, the more they believe in its benefits and are
more willing to use it.
AI-based tools can be useful for teaching Chemistry by offering interactive simulations, answering
questions, and providing feedback on student work. There are a number of studies regarding the use of
AI in chemistry education, but it should be noted that the overwhelming majority of them focus on
ChatGPT. For instance, dos Santos [37] explored the potential of generative chatbots based on artificial
intelligence, including GPT-4 and BingChat, in chemical education. The study demonstrated that
ChatGPT and BingChat act as “thinking agents” fostering critical thinking, problem solving, concept
understanding, creativity, and personalized learning.
Another group of researchers explored the potential of using artificial intelligence to enhance Chem-
istry teaching in high schools in Vietnam, using ChatGPT as an example. They identified the potential
of the ChatGPT text generation model from OpenAI. Their study showed that ChatGPT performed
well on intermediate-level Chemistry exams, but struggled with questions that required a high level of
knowledge application (e.g., analyzing and solving complex problems). Overall, ChatGPT’s responses
showed lower performance than most Vietnamese students in responses, indicating limitations in the
application of this tool. At the same time, a number of advantages of using ChatGPT were highlighted,
including increasing student engagement through interactive learning; providing immediate answers
and explanations to student questions; personalizing learning by adapting responses to individual
student needs; fostering critical thinking through open-ended questions and alternative points of
view; providing access to additional learning materials such as links and examples; and facilitating
self-directed learning [38].
In a similar study, Williams and Fadda [39] analyzed the responses of ChatGPT and ChatGPT Plus to
questions related to carbohydrate chemistry, a topic frequently included in many chemistry curricula.
The authors demonstrated that ChatGPT Plus performed significantly worse on test questions. Overall,
both language models performed better on simple, common questions for which ample information is
available.
Xuan-Quy et al. [40, 41, 42], Xuan-Quy and Le [43, 44] obtained interesting results in a series of their
studies, identifying the potential of ChatGPT for different science subjects at the high school graduate
level. It was found that the AI performed better on social science questions, such as Literature, History,
Geography, and Civic Education, and showed slightly worse results in answering questions on natural
science: Mathematics, Physics, Chemistry, and Biology.
44
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
Humphry and Fuller [45] discuss the potential use of ChatGPT in chemistry laboratories at the
bachelor’s education level. The authors propose using ChatGPT as a teaching tool for students to
demonstrate their understanding of specific topics by detecting and correcting chatbot’s errors. We
agree with the authors’ observation that ChatGPT is prone to conceptual errors in many of its chemistry-
related responses and explanations and should not be used for general chemistry education. Overall, it is
emphasized that ChatGPT can be useful for explaining basic concepts, but requires further development
to effectively teach more complex topics.
Williams and Fadda [39] investigated the potential of integrating ChatGPT into a “flipped class”
model for teaching chemistry. They demonstrated that this approach allowed students to take a more
active role in their learning, while the teacher assumed a supervising or guiding role. Additionally, the
authors tested ChatGPT’s ability in writing annotations, abstracts, and essays on chemistry topics –
tasks that are common and significant in the scientific community. However, the results indicated that
the bot often produced vague and repetitive text, frequently including inaccurate fabricated information
and invented references.
Given the importance of hands-on experience in chemistry education, researchers have turned their
attention to integrating large language models (LLMs) into chemistry experimentation and laboratory
training, which opens up new possibilities for enhancing the efficiency and safety of the learning
process [46]. LLMs can be applied throughout the entire laboratory workflow, from preparation to
result analysis, enabling a more holistic and personalized learning experience [47].
During the preparatory phase of laboratory work, LLMs can provide students with personalized
assistance by clarifying procedures described in lab manuals and explaining the specifics of equipment
setup. This allows students to better understand the objectives of the experiment and potential risks
without the need to process lengthy manuals. Furthermore, LLMs can reinforce safety measures,
ensuring students are fully aware of the importance of adhering to laboratory safety guidelines [48, 49].
During experiments, the integration of LLMs with augmented reality (AR) technologies opens up
new possibilities for personalized guidance. These systems can collect real-time data on students’
actions and provide instant feedback, suggestions, and warnings. Such interaction not only enhances
the educational process but also significantly improves safety level in the laboratory by identifying and
correcting unsafe practices in real time [50, 51].
During the analysis of experimental results, LLMs can serve as a powerful tool for bridging practical
observations with theoretical knowledge. They can assist students in analyzing collected data, facilitate
discussions about experimental outcomes, and link obtained data to relevant findings in the scientific
literature. This enables students to better comprehend their research results and their significance
within a broader scientific context [45].
Oh and Kang [52] proposed integrating AI into a laboratory experiment involving a carbon dioxide
fountain by using AI-based technology to regulate the laboratory setup. This approach not only
enhanced students’ understanding of the underlying chemical process and laboratory experiment but
also provided them with insights into the potential of AI in scientific research. Healy and Blade [53]
developed a project based on a state-of-the-art AI model for studying organic molecules of interest due
to their pharmaceutical significance and for further research, as well as for planning their synthesis.
The authors also demonstrated the potential of this project for online education and enhancing student
engagement in the learning process. In another study, Joss and Müller [54] assigned a task to students
requiring them to create a correlation for predicting the boiling points of organic compounds. The
processing of data for over 6,000 substances was conducted by the students using an artificial neural
network. It should be noted that working with AI tools requires a distinct set of knowledge and skills
compared to traditional search engines. One of the most crucial factors for effectively working with AI
is crafting the correct prompt, i.e., formulating an appropriate input query [53, 55]. Researchers are
currently paying significant attention to this issue. For instance, Tassoti [56] studied how chemistry
students work with AI and demonstrated that most students lack skills in effective prompting and tend
to simply copy and paste questions, thereby limiting the effectiveness of their interactions with AI.
Given the diversity of prompting strategies, ranging from basic approaches like “input-output”
prompting [57] to more advanced methods such as “Chain-of-Thought” [58] and “Tree-of-Thought”
45
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
prompting [59], and considering that students outside of digital technology fields are generally unaware
of these approaches, there is a need to train students in creating effective queries and prompts to
enhance their interactions with AI. Although the issue of prompt formulation was not the focus of our
study, as it is more relevant for open-ended queries, it remains a pressing issue in the field.
In general, an analysis of existing scientific reports on the application of AI in education underscores
the paramount importance and relevance of this topic. And the fact that the capabilities of AI have
only begun to be utilized by both teachers and students attests to its pressing nature, indicating that it
is currently at the forefront of attention. To properly establish the trajectory of the teacher-student-
AI interaction, it is necessary to explore the functionality of AI that can be used in education for
optimization purposes, as well as the limitations of AI, identifying its weaknesses and possible ways to
correct them.
Analyzing the capabilities of AI will also help determine which specific needs of users (teachers and
students) can be addressed using these technologies, as well as which features can best meet their needs.
Key challenges in using AI for chemical education can be identified. Firstly, this pertains to the
complexity of chemical science, which encompasses numerous nuances [60] and makes it difficult to
capture in AI algorithms. Secondly, accurate interpretation of chemical terms and concepts is crucial, as
they play a significant role in comprehending scientific material. Many chemical concepts are abstract
and challenging to comprehend, with multiple meanings, leading to potential misinterpretation by
artificial intelligence. Thirdly, chemical formulas, structures, and equations have a clear structure
that AI may not always recognize accurately, affecting its ability to generate correct answers or solve
problems. Therefore, to obtain correct answers to chemistry-related questions and issues, it is essential
to consider the context of the question, understand the meanings of chemical terms and concepts, and
apply logical thinking and analysis.
When considering the use of generative AI in the context of Ukrainian education, another important
aspect must be noted. It is crucial to remember that tools based on generative artificial intelligence
perform much better with the English language than with any other [61]. This is because AI perceives
queries as tokens, which are fundamental building blocks upon which the model learns, understands,
and processes language. This is where the most significant and crucial language differences arise.
Firstly, the English language has a smaller vocabulary size compared to the Ukrainian language. This
means that language processing algorithms can operate with fewer tokens (words), which simplifies
their operation and improves speed and accuracy. Secondly, the English language has fewer word
forms and word variations compared to the Ukrainian language. For example, many English words
have only one form, while Ukrainian words can have multiple variations depending on context, gender,
number, etc. This makes token analysis and recognition in Ukrainian more challenging. And thirdly,
since English is used as a language of communication in many fields, including science, technology and
business, there are far more sources of information available in English than in Ukrainian. Consequently,
due to the smaller vocabulary and complexity of word-formation forms of the Ukrainian language,
artificial intelligence algorithms can work more effectively with the English language in terms of token
processing, and ChatGPT, like most other AI models, was trained on tokens tailored for the English
language.
Our experience shows that Ukrainian students have recently started actively using AI in almost all
types of their academic activities, believing that AI can provide the correct answer to any question.
Often, they use AI as their primary source of information, placing complete trust in its outputs. As a
result, students often fail to critically analyze the information and responses generated by AI, neglecting
to verify them against reliable sources such as books, scientific journals, etc. Therefore, the aim of our
research was to study the potential of AI to act as a tutor and assist students in the learning process,
helping them find answers to questions of varying complexity, and to explore the potential of using
some large language models in teaching Chemistry in the Ukrainian-language educational environment.
46
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
3. Research methods
To begin with, we conducted research to identify the most popular AI-based services among pre-service
teachers at Sumy State Pedagogical University named after A. S. Makarenko, whose education programs
include the study of chemical disciplines. We also analyzed the services used by teachers in preparing
for chemistry classes. For this purpose, we developed a questionnaire using Google Forms and employed
interview methodology.
The questionnaire included the following questions:
• Do you have experience using artificial intelligence?
• Rate your level of familiarity with the types and principles of artificial intelligence and neural
networks on a scale from 1 to 5 (1 – low, 5 – high).
• What tools have you used?
• If you have used artificial intelligence, what was the purpose?
• Have you used AI-based services to solve tasks or find answers to questions in chemical disci-
plines?
• Have you encountered cases where AI-based services provided incorrect answers or solved tasks
incorrectly?
• Have you experienced instances of AI hallucinations (when AI generated something that cannot
be true)? Please specify the service where you noticed hallucinations.
• When did you notice AI hallucinations more frequently?
Additionally, the method of interviewing students about their expectations from AI was used.
The next step was to assess the potential of LLMs and their applicability in addressing chemistry-
related questions and tasks. To achieve this, we identified specific chemical disciplines and types of
questions that would help us analyze the limitations of LLM usage. Additionally, we conducted a compar-
ative analysis of the capabilities of selected LLMs in solving different tasks related to various chemical
disciplines studied by students at Sumy State Pedagogical University named after A. S. Makarenko,
who are future chemistry teachers. We then compared the results of the LLMs to the average scores
achieved by the students during testing. In total, the test results of 36 students were analyzed.
A review of studies on the topic has shown that large language models have limitations in under-
standing Chemistry as a subject despite their seemingly well-reasoned generated responses, as they
lack the ability to think and reason logically or demonstrate understanding [62]. Fergus et al. [63]
showed that AI is quite good at answering questions related simply to knowledge demonstration, but
has limitations in processing questions that require interpretation of non-textual information, such as
analysis of structural formulas. Taking these facts into account, we selected multiple-choice questions
of varying complexity. The lower-order questions required only the correct description of specific
concept or phenomenon. The tests also included a series of higher-order questions that required the
ability to analyze the provided information, consider the context, and think logically.
Disciplines such as Structure of Matter, Organic Chemistry, Environmental Chemistry and Laboratory
Chemical Practice (LCP) are among the compulsory courses for students studying Chemistry. It should be
noted that our many years of experience show that the first three of these disciplines are quite challenging
for students to study, since they encompass all modern concepts not only in the field of Chemistry, but
also Physics and Mathematics (Structure of Matter), Biology and Ecology (Environmental Chemistry).
Organic Chemistry, in particular, is actually based on a huge amount of data on the structures of
substances, their properties and relationships, operating with substance formulas. Laboratory Chemical
Practice involves the formation of students’ knowledge about laboratory glassware, equipment, reagents,
and basic operations with them.
4. Results
The results of the first stage of our study showed that, during the educational process, our students use
LLMs, and not specific neural networks intended for the study of chemistry, which is understandable
47
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
given their pedagogical specialization. As the survey revealed, the most popular tools among both
students and teachers are the freely accessible ChatGPT 3.5, Gemini (Google Bard), and Copilot (figure 1).
Figure 1: Percentage of answers to question “What tools have you used?”.
Therefore, we proceeded with further research using these specific large language models. These
services were given identical questions related to the selected disciplines: Structure of Matter, Organic
Chemistry, Environmental Chemistry and Laboratory Chemical Practice. Students provided answers to
these same questions while studying these academic disciplines during the 2023-2024 academic year
using the Moodle distance learning platform, which is used at our university. Examples of some test
questions and the answers given by the LLMs are shown in table 2.
The test questions in Organic Chemistry were formulated to assess knowledge of nomenclature,
the structure of various classes of organic compounds, methods of their synthesis and their chemical
reactivity. We selected the topic “Alkenes and Alkadienes”. Let’s analyze the large language models
responses to some of these questions. We chose several lower-order questions simply requiring demon-
strating knowledge of certain facts or laws. Correct answers to such questions are well-known facts,
presented in almost all textbooks on Organic Chemistry, even covered in high school Chemistry courses,
and do not require additional logical reasoning. However, all LLMs provided incorrect answers to some
of these questions. For example, in the question: “In the industry, buta-1,3-diene is obtained by the
Lebedev method (dehydration). What substance is used for this as raw material?” all three AIs made a
mistake. Moreover, both ChatGPT and Copilot provided their answers with explanatory comments.
And for this question, the argumentation was based on the assertion that the Lebedev process involves
the dehydration of acetylene. However, acetylene is an unsaturated hydrocarbon that does not contain
a hydroxyl group and therefore cannot be dehydrated. Other questions required not just knowledge of
a fact or law, but the ability to analyze and apply them in non-standard situations. Gemini failed to
cope with the task because all its answers were incorrect and far from the truth. Particular attention
is drawn to its response to the question about the reaction of propene with hydrochloric acid, with
the formula HCI. Here, it was necessary to demonstrate knowledge of the reactivity of alkenes and
take into account Markovnikov’s rule. The response provided by the LLM indicates that it considered
this reaction as addition, which is correct. However, the product it indicated contains 2 chlorine atoms,
which is nonsense, since the formula of hydrochloric acid contains only one chlorine atom. Therefore, it
is evident that in this case, the AI hallucinated, as it replaced the reagent and performed the addition of
a chlorine molecule instead of hydrochloric acid. ChatGPT provided an incorrect answer, while Copilot
gave the correct answer with proper reasoning. Additionally, for some questions, ChatGPT offered two
answer options, one of which was correct, but the justification provided was irrelevant to the correct
answer.
Gemini performed slightly better than ChatGPT and Copilot on test questions in the discipline
Structure of Matter. All three large language models mostly provided correct answers to lower-order
questions. However, for questions requiring data analysis, comparison, and logical inference, they either
48
Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59
Table 2
Examples of questions and LLMs responses in chemical disciplines.
№ Questions ChatGPT Responses Gemini Responses Copilot Responses
1 Indicate the final prod- In this reaction, potas- d. 1,2- When propene reacts
uct of the reaction: sium hydroxide (KOH) dichloropropane with HCl, a hydrogen
propene + HCl reacts with the halide atom attaches to a
a. 2-chloropropane (cor- (2,3-dibromobutane) carbon atom that has
rect answer) under well-known more hydrogen atoms
b. 1,3-dichloropropane conditions, such as nu- (Markovnikov’s rule),
c. 2,2-dichloropropane cleophilic substitution. resulting in the for-
d. 1,2-dichloropropane Thus, the product of mation of product a
e. 1-chloropropane the reaction will be (2-chloropropane)
but-2-ene (a)
2 Indicate in which ag- Correct answer: b. in a. in the solid Correct answer: b. in the
gregate state the po- the liquid liquid
tential energy of par-
ticles significantly ex-
ceeds the kinetic energy,
Ekine.