MAHASAMUT: AI-Powered Thai Tourism using Multimodal Agents Nongnuch Ketui1,* , Parinthapat Pengpun2 , Konthee Boonmeeprakob3 , Pitikorn Khlaisamniang4 and Thepchai Supnithi5 1 Rajamangala University of Technology Lanna 2 Bangkok Christian International School 3 Big Data Institute 4 Artificial Intelligence Association of Thailand 5 NECTEC Abstract This paper presents "MAHASAMUT" an application designed to enhance the travel experience in Thailand. MAHASAMUT stands for Multimodal Assistant for Helping Adventurous Sightseers and All Manner of Unique Travels. The system leverages AI technologies to provide comprehensive support for travelers. It features an ASR (Automatic Speech Recognition) system capable of understanding various Thai dialects and a TTS (Text-to- Speech) module for seamless communication with local people. Additionally, it includes a VQA (Visual Question Answering) model that generates descriptive captions from images, making it easier for travelers to comprehend their surroundings. The VQA model is also able to perform OCR (Optical Character Recognition) to identify and translate Thai text into English, aiding travelers in navigating signs, menus, and other written materials. Furthermore, MAHASAMUT can generate images from textual descriptions, such as visualizing a dish from a menu that lacks images. By integrating these multimodal capabilities, MAHASAMUT enhances the accessibility and enjoyment of traveling in Thailand, providing an intuitive and interactive guide for tourists. Keywords Multi-Agent AI Systems, Multimodality, AI Tourism, 1. Introduction The tourism industry, a cornerstone of many economies, faces increasing challenges in the era of globalization and digital transformation. Nowhere is this more evident than in Thailand, where tourism contributes significantly to the GDP [1, 2]. However, traditional tourism models often lead to issues such as overcrowding at popular destinations, cultural misunderstandings, and economic disparities that can negatively impact local communities [3]. Artificial Intelligence (AI) has emerged as a powerful tool to address these challenges, offering potential solutions for personalized experiences, language barriers, and sustainable tourism practices [4, 5]. Recent advancements in natural language processing, computer vision, and multi-agent systems have opened new avenues for enhancing the tourism experience [6, 7]. However, existing AI applications in tourism often focus on narrow aspects such as recommendation systems or language translation, without fully addressing the comprehensive nature of cultural tourism [5, 8, 9]. There is a growing need for comprehensive AI solutions that can navigate the intricate landscape of cultural tourism, particularly in linguistically and culturally diverse countries like Thailand. Such solutions must not only provide practical assistance to tourists but also promote sustainable practices and support local economies [10]. In this paper, we present MAHASAMUT, an AI-powered multi-agent system designed to enhance cultural tourism in Thailand. Our system integrates a multitude of AI technologies to combine into a comprehensive system, including: The 9th Linguistic and Cognitive Approaches to Dialog Agents Workshop, November 19, 2024, Kyoto, Japan * Corresponding Author. $ nongnuchketui@rmutl.ac.th (N. Ketui) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 56 1. A culturally-aware large language model (LLM) fine-tuned for tourism contexts integrated with retrieval augmented generation (RAG) over the internet 2. An automatic speech recognition (ASR) system capable of understanding Thai dialects 3. A visual question-answering (VQA) component for landmark and artifact interpretation 4. Image and speech generation components for easy two-way communication 5. A coordinated multi-agent architecture for complex query resolution MAHASAMUT aims to provide personalized, culturally sensitive guidance to tourists while promoting off-the-beaten-path experiences and supporting local economies. By leveraging AI to bridge language and cultural gaps, our system facilitates more meaningful interactions between tourists and local communities. The rest of this paper is organized as follows: Section 2 reviews related work in AI applications for tourism. Section 3 describes the system architecture and key components of MAHASAMUT. Section 4 details the application of our system in cultural tourism scenarios. We discuss the implications and future directions of our work in Section 5, before concluding in Section 6. 2. Related Work AI Applications in Tourism The travel and tourism industry has increasingly adopted AI technologies to enhance various aspects of the tourist experience. For instance, Filieri et al. (2020) explored the characteristics of AI start-ups in Europe that focus on tourism, highlighting the significance of AI in reshaping the industry through applications in big data, machine learning, and natural language processing. These technologies enable marketing automation, segmentation, and customization, sig- nificantly benefiting the tourism supply chain, particularly in the pre-trip and post-trip phases [11]. Additionally, the development of chatbots for smart tourism mobile applications has shown promise in improving tourist experiences by providing tailored information and assistance before, during, and after their visits [12]. Supporting LLM, ASR, and TTS The integration of automatic speech recognition (ASR) and text-to-speech (TTS) systems in tourism applications has been explored to facilitate seamless communication between tourists and locals. For instance, the development of speech-to-speech translation interfaces has proven effective in enabling tourists to interact using their native languages, thereby enhancing their travel experiences [13]. Furthermore, visual question answering (VQA) systems, which can handle multilingual and code-mixed queries, are increasingly being used to provide detailed information about tourism objects based on images captured by mobile devices [14]. These technologies collectively contribute to a more interactive and accessible travel experience, aligning with the goals of MAHASAMUT. 3. Methodology 3.1. System Overview The MAHASAMUT system is a multi-agent AI architecture designed to enhance cultural tourism experiences in Thailand. At its core, the system leverages a series of specialized agents orchestrated by the PhuKhao LLM. It first analyzes user queries and determines which specialized components should be employed to address the request effectively, allowing for dynamic and context-aware processing of user inputs. PhuKhao LLM, a large language model fine-tuned on a proprietary synthetic dataset, forms the backbone of the system. It has been trained on a diverse range of tasks and topics, including cultural content specific to Thai tourism, enabling it to handle a wide array of queries with cultural sensitivity and accuracy. 57 Search VQA Agent Agent Stable Diffusion PhuKhao LLM ASR Agent TTS Figure 1: Diagram of the MAHASAMUT System MAHASAMUT incorporates several key components that work in concert to provide comprehensive assistance to users. These include a Search Agent utilizing LangChain [15] and DuckDuckGo 1 for retrieving and summarizing up-to-date information, a Visual Question Answering (VQA) Agent based on a fine-tuned version of PaliGemma [16] for image-related queries, and an Image Generation Agent that creates contextually relevant images using Stable Diffusion 3 [17]. To facilitate seamless interaction in various languages and modalities, the system also includes specialized modules for Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and translation. These components are crucial for navigating the diverse linguistic landscape of tourism in Thailand. The typical workflow of the MAHASAMUT system begins with user input, which is then analyzed by the Judgment Agent. Based on this analysis, the system engages the appropriate specialized agents or modules for processing. The core PhuKhao LLM then processes the query, incorporating any additional context from the specialized components. Finally, the system generates a response, which may include text, images, or speech, depending on the nature of the query and the most appropriate format for the user. 3.2. PhuKhaoLLM The PhuKhao LLM is the core language model of the MAHASAMUT system, designed to handle Thai cultural and tourism contexts. It is built upon Typhoon-1.5 [18], a model fine-tuned by SCB10x based on the LLaMA 3 8B [19] architecture. Key features of PhuKhao LLM include: • 8 billion parameters • Fine-tuned using QLoRa [20] • Training data: approximately 20,000 examples covering various tourism-related domains (gener- ated using Seed-Free Synthetic Instruct [21]) • Task types: summarization, closed question-answering, open-ended conversation, and multiple- choice questions • Fine-tuned for one epoch to balance knowledge and general capabilities The model is tailored to handle queries related to Thai cuisine, local customs, historical sites, and travel information. While there are no standardized benchmarks for Thai tourism-specific language models, internal evaluations show improvements over the base Typhoon model in tourism-related tasks. PhuKhao LLM’s primary focus is on natural language understanding and generation within the context of Thai culture and tourism, aligning with the MAHASAMUT system’s objectives. 1 https://duckduckgo.com/ 58 Table 1 Performance of VQA Models Model BLEU Score (git-base)-PyThaiNLP 49.8 CLIP_laion2B-HoogBERTa 50.6 (blip2-opt-2.7b-coco)-PyThaiNLP 52.0 DinoV2-HoogBERTa 52.1 (blip2-flan-t5-xxl)-PyThaiNLP 53.2 Idefics2 54.5 PaliGemma 61.5 3.3. VQA Agent To develop a robust Visual Question Answering (VQA) agent within the MAHASAMUT application, we employed the PaliGemma model family, specifically focusing on the pretrained (PT) models. These models are particularly suited for transfer learning, making them ideal for fine-tuning on tasks such as image captioning and VQA in the Thai language. We initially tested the mixed (mix) models due to their diverse training on various tasks, which provided valuable insights into their performance across different scenarios. For fine-tuning, we selected models with a 448x448 resolution and float32 precision, balancing memory efficiency and performance to capture and interpret image nuances accurately. The training process utilized the MSCoco [22] and IPU2024 [23] datasets. The MSCoco dataset, a benchmark for image captioning, included 118,287 training images, 5,000 validation images, and 40,670 test images, offering a diverse set of images with corresponding captions. The IPU2024 dataset, tailored to Thai cultural contexts, included images related to Thai food and travel destinations, enhancing the model’s ability to generate accurate Thai captions. This domain-specific dataset was crucial for providing localized and contextually relevant support to users. Fine-tuning the captioning models yielded exceptional results, particularly with the PaliGemma model, which achieved the highest BLEU score of 61.5 (see Table 1). This score indicates its superior capability in generating accurate Thai captions tasks. The BLEU score is a metric for evaluating the quality of text generated by a model, with higher scores reflecting better performance. The models using PyThaiNLP leverage English captioning translated into Thai, while models like CLIP-HoogBERTa and DinoV2-HoogBERTa employ encoder-decoder architectures for direct Thai captioning. The superior performance of PaliGemma underscores its effectiveness and reliability in interpreting about images. Fine-tuning the models yielded exceptional results, with the system excelling in generating Thai captions, performing VQA tasks, and recognizing Thai text through Optical Character Recognition (OCR). The model’s VQA capabilities allowed it to interpret and answer questions about images, making the travel experience more engaging and informative. Additionally, the OCR functionality enabled the translation of Thai text from images, such as signs and menus, into English, aiding non-Thai speaking travelers in navigating their surroundings. These advancements underscore the potential of AI in transforming the tourism industry by offering personalized and contextually relevant assistance, enhancing the travel experience in Thailand. 3.4. ASR Agent To enable MAHASAMUT to understand spoken Thai, including various dialects, we implemented an Automatic Speech Recognition (ASR) system leveraging state-of-the-art deep learning models and techniques specifically adapted for the Thai language. We selected the Whisper [24] model family, particularly the openai/whisper-large-v3, due to its robust performance across multiple languages and its ability to handle diverse accents and acoustic conditions. The fine-tuning process involved adapting the Whisper model to Thai-specific phonetics and language patterns using transfer learning techniques. This process focused on improving accuracy for Thai 59 Table 2 Performance of ASR Models Model WER Score wav2vec2-large-xlsr-53 0.199 whisper-medium 0.163 whisper-large-v3 0.130 pronunciation, tonal distinctions, and dialect variations. The dataset used for fine-tuning the ASR system is from the Thai Dialects Speech Recognition Challenge [25], consisting of speech samples from four regional dialects: Central (CT), Eastern (ET), Northern (NT), and Southern (ST). This dataset is divided into training, development, and test sets, with balanced distributions across dialects. Specifically, it includes 48,000 utterances from 480 speakers in the training set, 5,200 utterances from 52 speakers in the development set, and 2,800 utterances from 28 speakers in the test set. For the ASR agent, we utilized two models: the Thai dialects fine-tuned Whisper for Thai ASR and the original Whisper for English ASR. Before processing the input, a classification step determines whether the input is in Thai or English, ensuring that the appropriate model is used. This dual-model approach allows MAHASAMUT to accurately interpret and respond to spoken language, enhancing communication for travelers in Thailand by providing seamless interaction across multiple dialects and languages. The fine-tuning results demonstrated improvements in the model’s ability to accurately transcribe Thai speech. As shown in Table 2, the Whisper-large-v3 model achieved the best Word Error Rate (WER) of 0.130. WER is a common metric for evaluating the performance of an ASR system, representing the percentage of words that were incorrectly predicted. A lower WER indicates higher accuracy. The superior performance of the Whisper-large-v3 model in recognizing Thai speech, especially considering the diverse dialects and tonal variations, underscores its effectiveness and reliability. This performance solidified its selection as the primary ASR component for MAHASAMUT. 3.5. Search Engine Agent To enhance MAHASAMUT’s information retrieval capabilities, we integrated a search engine component using the DuckDuckGo search API. This integration enables the system to access up-to-date information from the web, providing users with relevant and current data about their travel queries. The search engine integration process involves several steps to ensure efficient and accurate information retrieval. The first step involves query formulation and search execution. We implemented the DuckDuckGo search API to perform web searches based on user queries. A query formulation mechanism was developed to translate user input into effective search terms, considering Thai language nuances and travel-specific contexts. The search engine is configured to retrieve the top 5 results for each query, balancing comprehensive coverage with processing efficiency. This approach ensures that users receive the most relevant and timely information for their travel needs. Next, we focus on web scraping and content extraction. Utilizing the Beautiful Soup library 2 , we parse the HTML content of each retrieved URL, specifically extracting text contained within ’

’ tags, which typically hold the main body content of web pages. We implemented error handling and timeout mechanisms to manage potential issues with webpage access or parsing, ensuring robust and reliable content extraction. Finally, content summarization and context aggregation are performed. Then, PhuKhao LLM is employed to summarize the extracted content from each of the top 5 search results. We developed prompts for the LLM to produce concise, relevant summaries focused on travel-related information. Each search result is processed independently, generating five separate summaries. These summaries are then 2 https://pypi.org/project/beautifulsoup4/ 60 Figure 2: ASR Model visible in the User Interface concatenated to create a comprehensive context. Another PhuKhaoLLM instance, generates answers to user questions based on this aggregated context. This multi-step process ensures that MAHASAMUT can provide accurate and detailed answers to user queries, enhancing the travel experience with timely and relevant information. 3.6. Image Generation Agent The Image Generation Agent within MAHASAMUT utilizes the advanced capabilities of Stable Diffusion 3 (SD3) to produce high-quality, contextually relevant images from text prompts. Stable Diffusion 3, developed by Stability AI, employs a sophisticated Multimodal Diffusion Transformer (MMDiT) architecture that uses separate sets of weights for text and image data, allowing for more accurate and coherent image generation. To generate prompts for SD3, the Image Generation Agent relies on the PhuKhao LLM. This LLM processes user inputs, which can be in either Thai or English, and generates comprehensive prompts in English that are suitable for image generation. By incorporating few-shot examples of Thai menu dishes, the LLM is able to understand and accurately reflect cultural nuances in its prompts. For instance, when a user describes a dish like "Pad Thai", the LLM generates a prompt such as: "A detailed image of Pad Thai, a popular Thai stir-fried noodle dish, garnished with shrimp, peanuts, and lime, served on a traditional Thai plate" Once the LLM generates a prompt, it is fed into the Stable Diffusion 3 model, which then produces a high-quality image that closely matches the description. This process enhances various aspects of the travel experience, such as visualizing dishes from menus, creating images of landmarks or cultural artifacts, and providing visual navigation aids. 4. User Interaction 4.1. Two-way Communication As shown in Figure 2, users can easily speak into MAHASAMUT. The user’s voice will be interpreted by our English ASR Agent. Users can also speak Thai directly and our ASR Agent and it can be interpreted and understood as well. Utilizing the AI For Thai API for our Translation and TTS module, MA- HASAMUT is also able to directly translate and speak the user’s query— enabling two-way interaction between the users and a native speaker. 61 Figure 3: MAHASAMUT planning a trip using the PhuKhaoLLM and the Search Agent 4.2. Trip Planning As shown in Figure 3, MAHASAMUT is able to utilize its searching capabilities to provide users with up-to-date information, as well as citations of the information too. By providing MAHASAMUT with a query, it is able to output a detailed trip itinerary as shown. Furthermore, it is able to suggest local restaurants and also iconic places as well. 4.3. Visual Capabilities Figure 4: An example of MAHASAMUT doing OCR on Thai Text As shown in Figure 4, MAHASAMUT can directly read the menu which is written in Thai. Users can also then proceed to ask MAHASAMUT to generate an image of the menu if they are unsure what it is as shown in Figure 5. MAHASAMUT can also logically reason over the food— providinginformation such as the spiciness of the dish, it’s composition, normal price, its history, etc. 5. Discussion The development and implementation of MAHASAMUT present several implications for cultural tourism in Thailand and potentially for AI applications in tourism more broadly. MAHASAMUT has the potential to greatly enhance the tourist experience in Thailand by providing personalized, culturally sensitive guidance. By leveraging AI to bridge language barriers and offer deep cultural insights, the system could facilitate more meaningful interactions between tourists and local 62 Figure 5: MAHASAMUT generating an image of a Thai dish communities. This may lead to increased cultural understanding, more diverse tourism experiences, and potentially, economic benefits for less-visited areas. Future development of MAHASAMUT will focus on improvements across all components. This includes enhancing the accuracy and breadth of information provided, further optimizing system performance, and expanding language support. A key area for improvement is the user interface, which needs refinement to ensure accessibility and ease of use for a diverse range of users. 6. Conclusion This paper presented MAHASAMUT, an AI-powered multi-agent system designed to enhance cultural tourism in Thailand. By integrating the PhuKhao LLM, visual question answering, speech recognition, and image generation capabilities, MAHASAMUT addresses key challenges in cultural tourism such as language barriers and personalized recommendations. Our system demonstrates the potential of combining multiple AI technologies to create a comprehensive tourism assistance tool. While initial results are promising, further refinement of the user interface and extensive real-world testing are needed. MAHASAMUT represents a step towards more immersive and culturally sensitive AI applications in tourism. As development continues, we aim to balance technological innovation with ethical considerations, potentially contributing to more enriching and sustainable travel experiences. Acknowledgments We would like to express our gratitude to the Artificial Intelligence Association of Thailand (AIAT) for providing the essential facilities and resources that made this research possible. This work was financially supported by the National Science, Research and Innovation Fund (NSRF) through the Pro- gram Management Unit for Human Resources and Institutional Development, Research and Innovation [Grant Number: B13F670080] under project "Development of High Caliber Manpower in Artificial Intelligence and Prompt Engineer for Industry Support, focusing on Health, Energy and Environment, Finance and Digital Industry". References [1] C. Theparat, Prayut: Zones vital for growth, Bangkok Post (2019). URL: https://www.bangkokpost. com/business/1753349/prayut-zones-vital-for-growth, accessed online. [2] World Travel & Tourism Council, Travel & tourism economic impact 2023: Thailand (2023). URL: https://researchhub.wttc.org, accessed online. 63 [3] R. Dodds, R. Butler, The phenomena of overtourism: a review, International Journal of Tourism Cities ahead-of-print (2019). doi:10.1108/IJTC-06-2019-0090. [4] U. Gretzel, M. Sigala, Z. Xiang, C. Koo, Smart tourism: foundations and developments, Elec- tronic Markets 25 (2015) 179–188. URL: https://doi.org/10.1007/s12525-015-0196-8. doi:10.1007/ s12525-015-0196-8. [5] M. Li, D. Yin, H. Qiu, B. Bai, A systematic review of ai technology-based service encounters: Implications for hospitality and tourism operations, International Journal of Hospitality Manage- ment 95 (2021) 102930. URL: https://www.sciencedirect.com/science/article/pii/S0278431921000736. doi:https://doi.org/10.1016/j.ijhm.2021.102930. [6] D. Suhartanto, A. Brien, N. Sumarjan, N. Wibisono, Examining attraction loyalty formation in creative tourism, International Journal of Quality and Service Sciences 10 (2018) 163–175. URL: https://doi.org/10.1108/IJQSS-08-2017-0068. doi:10.1108/IJQSS-08-2017-0068. [7] D. Buhalis, Y. Sinarta, Real-time co-creation and nowness service: lessons from tourism and hospitality, Journal of Travel Tourism Marketing 36 (2019) 563–582. URL: https://doi.org/10.1080/ 10548408.2019.1592059. doi:10.1080/10548408.2019.1592059. [8] J. Borràs, A. Moreno, A. Valls, Intelligent tourism recommender systems: A survey, Expert Systems with Applications 41 (2014) 7370–7389. URL: https://www.sciencedirect.com/science/article/pii/ S0957417414003431. doi:https://doi.org/10.1016/j.eswa.2014.06.007. [9] Y. Yu, Application of translation technology based on ai in translation teaching, Systems and Soft Computing 6 (2024) 200072. URL: https://www.sciencedirect.com/science/article/pii/ S2772941924000012. doi:https://doi.org/10.1016/j.sasc.2024.200072. [10] X. Font, S. McCabe, Sustainability and marketing in tourism: its contexts, paradoxes, approaches, challenges and potential, Journal of Sustainable Tourism 25 (2017) 869–883. URL: https://doi.org/ 10.1080/09669582.2017.1301721. doi:10.1080/09669582.2017.1301721. [11] R. Filieri, E. D’Amico, A. Destefanis, E. Paolucci, E. Raguseo, Artificial intelligence (ai) for tourism: an european-based study on successful ai tourism start-ups, International Journal of Contemporary Hospitality Management ahead-of-print (2021). doi:10.1108/IJCHM-02-2021-0220. [12] L. Benaddi, C. Ouaddi, A. Jakimi, B. Ouchao, Towards a software factory for developing the chatbots in smart tourism mobile applications, Procedia Computer Science 231 (2024) 275–280. URL: https://www.sciencedirect.com/science/article/pii/S1877050923022159. doi:https://doi. org/10.1016/j.procs.2023.12.203, 14th International Conference on Emerging Ubiquitous Systems and Pervasive Networks / 13th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (EUSPN/ICTH 2023). [13] M. Cettolo, A. Corazza, G. Lazzari, F. Pianesi, E. Pianta, L. M. Tovena, A speech-to-speech translation based interface for tourism, in: D. Buhalis, W. Schertler (Eds.), Information and Communication Technologies in Tourism 1999, Springer Vienna, Vienna, 1999, pp. 191–200. [14] D. Gupta, P. Lenka, A. Ekbal, P. Bhattacharyya, A unified framework for multilingual and code- mixed visual question answering, in: K.-F. Wong, K. Knight, H. Wu (Eds.), Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China, 2020, pp. 900–913. URL: https://aclanthology.org/2020. aacl-main.90. [15] H. Chase, Langchain, ???? URL: https://github.com/langchain-ai/langchain. [16] M. development contributors, L. Beyer*, A. Steiner*, A. S. Pinto*, A. Kolesnikov*, X. Wang*, X. Zhai*, D. Salz, M. Neumann, I. Alabdulmohsin, et al., Paligemma (2024). URL: https://www.kaggle.com/ m/23393. doi:10.34740/KAGGLE/M/23393. [17] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, R. Rombach, Scaling rectified flow transformers for high-resolution image synthesis (2024). [18] K. Pipatanakul, P. Jirabovonvisut, P. Manakul, S. Sripaisarnmongkol, R. Patomwong, P. Chokchainant, K. Tharnpipitchai, Typhoon: Thai large language models, 2023. URL: https: //arxiv.org/abs/2312.13951. arXiv:2312.13951. 64 [19] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [20] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of quantized llms, 2023. URL: https://arxiv.org/abs/2305.14314. arXiv:2305.14314. [21] P. Pengpun, C. Udomcharoenchaikit, W. Buaphet, P. Limkonchotiwat, Seed-free synthetic data generation framework for instruction-tuning LLMs: A case study in Thai, in: X. Fu, E. Fleisig (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 438–457. URL: https://aclanthology.org/2024.acl-srw.38. doi:10.18653/v1/ 2024.acl-srw.38. [22] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, P. Dollár, Microsoft coco: Common objects in context, 2015. URL: https://arxiv.org/abs/1405.0312. arXiv:1405.0312. [23] Theerasit, Ai cooking image captioning, 2024. URL: https://kaggle.com/competitions/ ai-cooking-image-captioning. [24] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition via large-scale weak supervision, 2022. URL: https://arxiv.org/abs/2212.04356. arXiv:2212.04356. [25] K. Thangthai, S. Thatphithakkul, V. Chunwijitra, Ai-cooking asr dialects, Kaggle (2024). URL: https://www.kaggle.com/competitions/ai-cooking-asr-dialects. 65