Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 Advantages and limitations of large language models in chemistry education: A comparative analysis of ChatGPT, Gemini and Copilot Yuliia V. Kharchenko, Olena M. Babenko Sumy State Pedagogical University named after A. S. Makarenko, 87 Romenska Str., Sumy, 40002, Ukraine Abstract This study aims to explore the potential and limitations of large language models (LLMs) such as ChatGPT, Gemini, and Copilot, in the context of chemistry education. The primary objective of the study is to compare the effectiveness of LLMs in solving chemistry tasks and to identify the key challenges associated with their implementation in education. These LLMs were selected based on a survey of students which indicated their widespread use due to their free accessibility. To evaluate the potential of LLMs in chemistry education, we employed them to solve tasks corresponding to different levels of knowledge in different subfields of chemistry. A comparative evaluation of LLMs’ performance against that of average Ukrainian students was conducted. The results indicate that while LLMs show promise mainly in tasks not demanding deep logical reasoning, they are generally inferior to students. Key challenges in using LLMs in chemistry education identified include understanding the nuances of chemistry as a complex and multifaceted science, abstract concepts used in chemistry, recognition of chemical compound formulas, chemical reaction equations, limitations in logical reasoning, language barriers, and the occurrence of AI hallucinations. Additionally, there is a need for students to develop skills in crafting effective queries and prompts to enhance the efficiency of working with LLM. While LLMs are promising, their implementation requires addressing the identified limitations. Keywords artificial intelligence, LLM, ChatGPT, Gemini, Copilot, chemistry education 1. Introduction A characteristic feature of modern times is the continuous dynamics of change in all spheres of human life: economy, politics, science, and education. The dominant trend in the development of contemporary civilization is its transition into an information society, wherein information and communication technologies become the objects of human activity, providing all the necessary conditions for the formation and development of the personality of the new formation. The rapid development of the global Internet network has led to a computer revolution in the information world, where the computer serves as the primary means of telecommunication. Considering that the current stage of development of pedagogical science in the world is characterized by an intensive search for new ways to improve the quality of education, information and communication technologies have become powerful tools in this process [1]. Various forms of information and communication technologies (ICTs) have found active application in education, ranging from electronic textbooks [2], online learning technologies [3, 4, 5], mobile applications [6, 7] to augmented and virtual reality technologies [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. The latest term in the digital revolution is artificial intelligence (AI) [24, 25], including generative AI. Generative artificial intelligence (GenAI) is a technology that automatically generates content in response to queries. GenAI actually creates new content using existing content. Its output may encompass formats including all symbolic representations of human though: texts, images 3L-Person 2024: IX International Workshop on Professional Retraining and Life-Long Learning using ICT: Person-oriented Approach, co-located with the 19th International Conference on ICT in Education, Research, and Industrial Applications (ICTERI 2024) September 23, 2024, Lviv, Ukraine " yuvlakhar@gmail.com (Y. V. Kharchenko); olena.babenko@sspu.edu.ua (O. M. Babenko) ~ https://scholar.google.com.ua/citations?user=zYiU4iMAAAAJ (Y. V. Kharchenko); https://scholar.google.com.ua/citations?user=AeYIdfAAAAAJ (O. M. Babenko)  0000-0002-8960-2440 (Y. V. Kharchenko); 0000-0002-1416-2700 (O. M. Babenko) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 42 Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 (including drawings, photographs, videos and animations), music and software code. GenAI learns from data it collects from web pages, conversations in social networks, and other online media. It generates its content by statistically analyzing the order of words, pixels, or other elements in the data it has learned, and identifying and replicating common patterns (such as which words typically follow other words, and in what order). Text-based generative artificial intelligence, utilizing a type of artificial neural network known as a general-purpose transformer, is particularly popular. This type of AI, often referred to as Large Language Models (LLMs), is commonly known as a generative pre-trained transformer, or GPT. GPTs and their ability to automatically generate text became available to the global research commu- nity in 2018. The launch of ChatGPT in 2022, which offered free access and a user-friendly interface, became a sensation and led to active searches and technological solutions for other companies to create and launch new similar systems. By mid-2023, other alternatives to ChatGPT [26] became available, most of which were free (within certain limits), as well as services for generating images, videos, and music, some of which are listed in table 1. Table 1 AI-based services. Text GenAI models Image GenAI models Video GenAI models Music GenAI models Alpaca Craiyon Ela Aiva Bard (now Gemini) DALL-E GliaCloud Boomy Chatsonic DreamStudio Pictory Soundraw Ernie Fotor Runway Voicemod Hugging Chat Midjourney Jasper NightCafe Llama Photosonic Open Assistant Tongyi Qianwen YouChat Many other tools based on the aforementioned LLMs are also emerging, such as ChatPDF, which can work with PDF documents and analyze them, or Perplexity, which serves as a knowledge hub and helps users find answers to queries based on their needs. Similarly, the process of integrating LLMs into other products, such as web browsers, is ongoing. In Ukraine, users gained access to artificial intelligence in 2023, and now the range of services based on generative artificial intelligence has significantly expanded. 2. Literature review The increasingly deeper and active penetration and integration of AI into human activity could not fail to impact one of the most important components of human development, namely education. Worldwide, the initial concerns about the use of AI in education were linked to fears that students would use its capabilities primarily to cheat on academic tasks, thereby undermining the value of educational assessment, certification, and qualifications [27]. A clear consequence of such concerns was the prohibition by some educational institutions on students’ use of AI [28]. However, in other institutions, a more optimistic approach was taken towards the use of AI [29], believing that it is more progressive not to prohibit its use, but to provide support for both teachers and students in utilizing tools based on generative AI [30]. Today, it should be noted that artificial intelligence is increasingly being integrated into education with the aim of enhancing student learning efficiency and improving teaching practices. As indicated by the analysis of recent research, significant regarding the use of AI in the field of education, particularly in higher education, in educational and research activities, are occurring in the following key areas [31, 32]: 43 Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 • assessment (including automatic assessment and evaluation of educational progress and students’ attitude to learning, individual and group assessment, etc.); • predicting learning status (predicting student withdrawals, at-risk groups, innovative abilities, career decisions), productivity or satisfaction, improving the learning experience; • assistance (providing support to students in their educational pursuits, for example, anthropo- morphic presence, which includes virtual agents and intervention through digital programs); • tutoring (providing and supporting individual strategies and approaches to students, taking into account their characteristics and needs); • learning management (learning analytics, sequence of educational plans and programs, develop- ment of instructions, and student allocation). Educational tools based on artificial intelligence can offer personalized learning experiences, automate routine tasks, and provide real-time feedback and assessment. As numerous studies by scientists around the world have shown, AI has the potential to be a useful tool in teaching and learning Chemistry, particularly for creating interactive simulations, answering questions, and providing feedback on student work [33]. It can be used to create personalized learning experiences for students [34, 35]. The results of studying the attitudes of natural science teachers toward the use of artificial intelligence in teaching show that teachers are generally positive about integrating AI into the educational process. Key factors influencing their willingness to use AI include self-confidence, expected benefits, ease of use and general attitude towards AI technologies. Al Darayseh [36] found in his study that the easier teachers can integrate AI into teaching Natural Sciences, the more they believe in its benefits and are more willing to use it. AI-based tools can be useful for teaching Chemistry by offering interactive simulations, answering questions, and providing feedback on student work. There are a number of studies regarding the use of AI in chemistry education, but it should be noted that the overwhelming majority of them focus on ChatGPT. For instance, dos Santos [37] explored the potential of generative chatbots based on artificial intelligence, including GPT-4 and BingChat, in chemical education. The study demonstrated that ChatGPT and BingChat act as “thinking agents” fostering critical thinking, problem solving, concept understanding, creativity, and personalized learning. Another group of researchers explored the potential of using artificial intelligence to enhance Chem- istry teaching in high schools in Vietnam, using ChatGPT as an example. They identified the potential of the ChatGPT text generation model from OpenAI. Their study showed that ChatGPT performed well on intermediate-level Chemistry exams, but struggled with questions that required a high level of knowledge application (e.g., analyzing and solving complex problems). Overall, ChatGPT’s responses showed lower performance than most Vietnamese students in responses, indicating limitations in the application of this tool. At the same time, a number of advantages of using ChatGPT were highlighted, including increasing student engagement through interactive learning; providing immediate answers and explanations to student questions; personalizing learning by adapting responses to individual student needs; fostering critical thinking through open-ended questions and alternative points of view; providing access to additional learning materials such as links and examples; and facilitating self-directed learning [38]. In a similar study, Williams and Fadda [39] analyzed the responses of ChatGPT and ChatGPT Plus to questions related to carbohydrate chemistry, a topic frequently included in many chemistry curricula. The authors demonstrated that ChatGPT Plus performed significantly worse on test questions. Overall, both language models performed better on simple, common questions for which ample information is available. Xuan-Quy et al. [40, 41, 42], Xuan-Quy and Le [43, 44] obtained interesting results in a series of their studies, identifying the potential of ChatGPT for different science subjects at the high school graduate level. It was found that the AI performed better on social science questions, such as Literature, History, Geography, and Civic Education, and showed slightly worse results in answering questions on natural science: Mathematics, Physics, Chemistry, and Biology. 44 Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 Humphry and Fuller [45] discuss the potential use of ChatGPT in chemistry laboratories at the bachelor’s education level. The authors propose using ChatGPT as a teaching tool for students to demonstrate their understanding of specific topics by detecting and correcting chatbot’s errors. We agree with the authors’ observation that ChatGPT is prone to conceptual errors in many of its chemistry- related responses and explanations and should not be used for general chemistry education. Overall, it is emphasized that ChatGPT can be useful for explaining basic concepts, but requires further development to effectively teach more complex topics. Williams and Fadda [39] investigated the potential of integrating ChatGPT into a “flipped class” model for teaching chemistry. They demonstrated that this approach allowed students to take a more active role in their learning, while the teacher assumed a supervising or guiding role. Additionally, the authors tested ChatGPT’s ability in writing annotations, abstracts, and essays on chemistry topics – tasks that are common and significant in the scientific community. However, the results indicated that the bot often produced vague and repetitive text, frequently including inaccurate fabricated information and invented references. Given the importance of hands-on experience in chemistry education, researchers have turned their attention to integrating large language models (LLMs) into chemistry experimentation and laboratory training, which opens up new possibilities for enhancing the efficiency and safety of the learning process [46]. LLMs can be applied throughout the entire laboratory workflow, from preparation to result analysis, enabling a more holistic and personalized learning experience [47]. During the preparatory phase of laboratory work, LLMs can provide students with personalized assistance by clarifying procedures described in lab manuals and explaining the specifics of equipment setup. This allows students to better understand the objectives of the experiment and potential risks without the need to process lengthy manuals. Furthermore, LLMs can reinforce safety measures, ensuring students are fully aware of the importance of adhering to laboratory safety guidelines [48, 49]. During experiments, the integration of LLMs with augmented reality (AR) technologies opens up new possibilities for personalized guidance. These systems can collect real-time data on students’ actions and provide instant feedback, suggestions, and warnings. Such interaction not only enhances the educational process but also significantly improves safety level in the laboratory by identifying and correcting unsafe practices in real time [50, 51]. During the analysis of experimental results, LLMs can serve as a powerful tool for bridging practical observations with theoretical knowledge. They can assist students in analyzing collected data, facilitate discussions about experimental outcomes, and link obtained data to relevant findings in the scientific literature. This enables students to better comprehend their research results and their significance within a broader scientific context [45]. Oh and Kang [52] proposed integrating AI into a laboratory experiment involving a carbon dioxide fountain by using AI-based technology to regulate the laboratory setup. This approach not only enhanced students’ understanding of the underlying chemical process and laboratory experiment but also provided them with insights into the potential of AI in scientific research. Healy and Blade [53] developed a project based on a state-of-the-art AI model for studying organic molecules of interest due to their pharmaceutical significance and for further research, as well as for planning their synthesis. The authors also demonstrated the potential of this project for online education and enhancing student engagement in the learning process. In another study, Joss and Müller [54] assigned a task to students requiring them to create a correlation for predicting the boiling points of organic compounds. The processing of data for over 6,000 substances was conducted by the students using an artificial neural network. It should be noted that working with AI tools requires a distinct set of knowledge and skills compared to traditional search engines. One of the most crucial factors for effectively working with AI is crafting the correct prompt, i.e., formulating an appropriate input query [53, 55]. Researchers are currently paying significant attention to this issue. For instance, Tassoti [56] studied how chemistry students work with AI and demonstrated that most students lack skills in effective prompting and tend to simply copy and paste questions, thereby limiting the effectiveness of their interactions with AI. Given the diversity of prompting strategies, ranging from basic approaches like “input-output” prompting [57] to more advanced methods such as “Chain-of-Thought” [58] and “Tree-of-Thought” 45 Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 prompting [59], and considering that students outside of digital technology fields are generally unaware of these approaches, there is a need to train students in creating effective queries and prompts to enhance their interactions with AI. Although the issue of prompt formulation was not the focus of our study, as it is more relevant for open-ended queries, it remains a pressing issue in the field. In general, an analysis of existing scientific reports on the application of AI in education underscores the paramount importance and relevance of this topic. And the fact that the capabilities of AI have only begun to be utilized by both teachers and students attests to its pressing nature, indicating that it is currently at the forefront of attention. To properly establish the trajectory of the teacher-student- AI interaction, it is necessary to explore the functionality of AI that can be used in education for optimization purposes, as well as the limitations of AI, identifying its weaknesses and possible ways to correct them. Analyzing the capabilities of AI will also help determine which specific needs of users (teachers and students) can be addressed using these technologies, as well as which features can best meet their needs. Key challenges in using AI for chemical education can be identified. Firstly, this pertains to the complexity of chemical science, which encompasses numerous nuances [60] and makes it difficult to capture in AI algorithms. Secondly, accurate interpretation of chemical terms and concepts is crucial, as they play a significant role in comprehending scientific material. Many chemical concepts are abstract and challenging to comprehend, with multiple meanings, leading to potential misinterpretation by artificial intelligence. Thirdly, chemical formulas, structures, and equations have a clear structure that AI may not always recognize accurately, affecting its ability to generate correct answers or solve problems. Therefore, to obtain correct answers to chemistry-related questions and issues, it is essential to consider the context of the question, understand the meanings of chemical terms and concepts, and apply logical thinking and analysis. When considering the use of generative AI in the context of Ukrainian education, another important aspect must be noted. It is crucial to remember that tools based on generative artificial intelligence perform much better with the English language than with any other [61]. This is because AI perceives queries as tokens, which are fundamental building blocks upon which the model learns, understands, and processes language. This is where the most significant and crucial language differences arise. Firstly, the English language has a smaller vocabulary size compared to the Ukrainian language. This means that language processing algorithms can operate with fewer tokens (words), which simplifies their operation and improves speed and accuracy. Secondly, the English language has fewer word forms and word variations compared to the Ukrainian language. For example, many English words have only one form, while Ukrainian words can have multiple variations depending on context, gender, number, etc. This makes token analysis and recognition in Ukrainian more challenging. And thirdly, since English is used as a language of communication in many fields, including science, technology and business, there are far more sources of information available in English than in Ukrainian. Consequently, due to the smaller vocabulary and complexity of word-formation forms of the Ukrainian language, artificial intelligence algorithms can work more effectively with the English language in terms of token processing, and ChatGPT, like most other AI models, was trained on tokens tailored for the English language. Our experience shows that Ukrainian students have recently started actively using AI in almost all types of their academic activities, believing that AI can provide the correct answer to any question. Often, they use AI as their primary source of information, placing complete trust in its outputs. As a result, students often fail to critically analyze the information and responses generated by AI, neglecting to verify them against reliable sources such as books, scientific journals, etc. Therefore, the aim of our research was to study the potential of AI to act as a tutor and assist students in the learning process, helping them find answers to questions of varying complexity, and to explore the potential of using some large language models in teaching Chemistry in the Ukrainian-language educational environment. 46 Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 3. Research methods To begin with, we conducted research to identify the most popular AI-based services among pre-service teachers at Sumy State Pedagogical University named after A. S. Makarenko, whose education programs include the study of chemical disciplines. We also analyzed the services used by teachers in preparing for chemistry classes. For this purpose, we developed a questionnaire using Google Forms and employed interview methodology. The questionnaire included the following questions: • Do you have experience using artificial intelligence? • Rate your level of familiarity with the types and principles of artificial intelligence and neural networks on a scale from 1 to 5 (1 – low, 5 – high). • What tools have you used? • If you have used artificial intelligence, what was the purpose? • Have you used AI-based services to solve tasks or find answers to questions in chemical disci- plines? • Have you encountered cases where AI-based services provided incorrect answers or solved tasks incorrectly? • Have you experienced instances of AI hallucinations (when AI generated something that cannot be true)? Please specify the service where you noticed hallucinations. • When did you notice AI hallucinations more frequently? Additionally, the method of interviewing students about their expectations from AI was used. The next step was to assess the potential of LLMs and their applicability in addressing chemistry- related questions and tasks. To achieve this, we identified specific chemical disciplines and types of questions that would help us analyze the limitations of LLM usage. Additionally, we conducted a compar- ative analysis of the capabilities of selected LLMs in solving different tasks related to various chemical disciplines studied by students at Sumy State Pedagogical University named after A. S. Makarenko, who are future chemistry teachers. We then compared the results of the LLMs to the average scores achieved by the students during testing. In total, the test results of 36 students were analyzed. A review of studies on the topic has shown that large language models have limitations in under- standing Chemistry as a subject despite their seemingly well-reasoned generated responses, as they lack the ability to think and reason logically or demonstrate understanding [62]. Fergus et al. [63] showed that AI is quite good at answering questions related simply to knowledge demonstration, but has limitations in processing questions that require interpretation of non-textual information, such as analysis of structural formulas. Taking these facts into account, we selected multiple-choice questions of varying complexity. The lower-order questions required only the correct description of specific concept or phenomenon. The tests also included a series of higher-order questions that required the ability to analyze the provided information, consider the context, and think logically. Disciplines such as Structure of Matter, Organic Chemistry, Environmental Chemistry and Laboratory Chemical Practice (LCP) are among the compulsory courses for students studying Chemistry. It should be noted that our many years of experience show that the first three of these disciplines are quite challenging for students to study, since they encompass all modern concepts not only in the field of Chemistry, but also Physics and Mathematics (Structure of Matter), Biology and Ecology (Environmental Chemistry). Organic Chemistry, in particular, is actually based on a huge amount of data on the structures of substances, their properties and relationships, operating with substance formulas. Laboratory Chemical Practice involves the formation of students’ knowledge about laboratory glassware, equipment, reagents, and basic operations with them. 4. Results The results of the first stage of our study showed that, during the educational process, our students use LLMs, and not specific neural networks intended for the study of chemistry, which is understandable 47 Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 given their pedagogical specialization. As the survey revealed, the most popular tools among both students and teachers are the freely accessible ChatGPT 3.5, Gemini (Google Bard), and Copilot (figure 1). Figure 1: Percentage of answers to question “What tools have you used?”. Therefore, we proceeded with further research using these specific large language models. These services were given identical questions related to the selected disciplines: Structure of Matter, Organic Chemistry, Environmental Chemistry and Laboratory Chemical Practice. Students provided answers to these same questions while studying these academic disciplines during the 2023-2024 academic year using the Moodle distance learning platform, which is used at our university. Examples of some test questions and the answers given by the LLMs are shown in table 2. The test questions in Organic Chemistry were formulated to assess knowledge of nomenclature, the structure of various classes of organic compounds, methods of their synthesis and their chemical reactivity. We selected the topic “Alkenes and Alkadienes”. Let’s analyze the large language models responses to some of these questions. We chose several lower-order questions simply requiring demon- strating knowledge of certain facts or laws. Correct answers to such questions are well-known facts, presented in almost all textbooks on Organic Chemistry, even covered in high school Chemistry courses, and do not require additional logical reasoning. However, all LLMs provided incorrect answers to some of these questions. For example, in the question: “In the industry, buta-1,3-diene is obtained by the Lebedev method (dehydration). What substance is used for this as raw material?” all three AIs made a mistake. Moreover, both ChatGPT and Copilot provided their answers with explanatory comments. And for this question, the argumentation was based on the assertion that the Lebedev process involves the dehydration of acetylene. However, acetylene is an unsaturated hydrocarbon that does not contain a hydroxyl group and therefore cannot be dehydrated. Other questions required not just knowledge of a fact or law, but the ability to analyze and apply them in non-standard situations. Gemini failed to cope with the task because all its answers were incorrect and far from the truth. Particular attention is drawn to its response to the question about the reaction of propene with hydrochloric acid, with the formula HCI. Here, it was necessary to demonstrate knowledge of the reactivity of alkenes and take into account Markovnikov’s rule. The response provided by the LLM indicates that it considered this reaction as addition, which is correct. However, the product it indicated contains 2 chlorine atoms, which is nonsense, since the formula of hydrochloric acid contains only one chlorine atom. Therefore, it is evident that in this case, the AI hallucinated, as it replaced the reagent and performed the addition of a chlorine molecule instead of hydrochloric acid. ChatGPT provided an incorrect answer, while Copilot gave the correct answer with proper reasoning. Additionally, for some questions, ChatGPT offered two answer options, one of which was correct, but the justification provided was irrelevant to the correct answer. Gemini performed slightly better than ChatGPT and Copilot on test questions in the discipline Structure of Matter. All three large language models mostly provided correct answers to lower-order questions. However, for questions requiring data analysis, comparison, and logical inference, they either 48 Yuliia V. Kharchenko et al. CEUR Workshop Proceedings 42–59 Table 2 Examples of questions and LLMs responses in chemical disciplines. № Questions ChatGPT Responses Gemini Responses Copilot Responses 1 Indicate the final prod- In this reaction, potas- d. 1,2- When propene reacts uct of the reaction: sium hydroxide (KOH) dichloropropane with HCl, a hydrogen propene + HCl reacts with the halide atom attaches to a a. 2-chloropropane (cor- (2,3-dibromobutane) carbon atom that has rect answer) under well-known more hydrogen atoms b. 1,3-dichloropropane conditions, such as nu- (Markovnikov’s rule), c. 2,2-dichloropropane cleophilic substitution. resulting in the for- d. 1,2-dichloropropane Thus, the product of mation of product a e. 1-chloropropane the reaction will be (2-chloropropane) but-2-ene (a) 2 Indicate in which ag- Correct answer: b. in a. in the solid Correct answer: b. in the gregate state the po- the liquid liquid tential energy of par- ticles significantly ex- ceeds the kinetic energy, Ekine.