Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Overview of small language models in practice Ruslan O. Popov, Nadiia V. Karpenko and Volodymyr V. Gerasimov Oles Honchar Dnipro National University, 72 Nauky Ave., Dnipro, 49010, Ukraine Abstract In this paper, we addressed the topic of Small Language Models (SLMs), focusing on their practical features and experimental applications. Our study explores the field of Language Modeling (LM) and highlights the breakthroughs that Large Language Models (LLMs) have introduced in Natural Language Processing (NLP). Key aspects of LLMs, such as embeddings and attention layers, are examined, along with the disadvantages that have prompted the rise of SLMs. We analyzed methods for obtaining SLMs, including pruning, knowledge distillation, and quantization, and discuss how SLMs can potentially overcome the limitations of LLMs. Experimental data on SLM usage is presented, though the current evidence is insufficient to fully evaluate SLMs in comparison to their larger counterparts. To better understand the capabilities of SLMs, we conducted a question-and-answer (Q&A) experiment using sanity questions designed to test the models’ reliability and use of common knowledge. Additionally, we examine terminology within the AI and LLM fields, identifying ambiguities around terms such as “SLM”, “local”, and “remote” models, and propose refined definitions. Finally, we present a diverse and user-friendly collection of tools for managing and running both LLMs and SLMs, emphasizing their accessibility. Keywords large language models, small language models, artificial intelligence, generative AI, natural language processing 1. Introduction Large language models (LLMs) have become a hot topic in academic and practical research. They have found numerous applications and have been showing good performance on various exam tests [1]. LLMs abilities allowed one to surpass many previous machine learning (ML) models on several tasks (sometimes even better than humans [2]). It is crucial to explore their capabilities and how to utilize them effectively. However, there is one problem with LLMs: to get better results, you need to have a bigger model. A bigger model requires a lot of computational resources, including RAM, graphical processing units (GPUs), electricity, and so on. To overcome these problems (and other problems with remote LLMs, too) a field of small language models (SLMs) has emerged [3]. They are much smaller than production-grade remote LLMs like ChatGPT or Gemini and can fit on an average consumer GPU. Unfortunately (or perhaps, obviously), SLMs, in general, perform less on benchmarks in comparison to LLMs. This is a trade-off – you have to pay for model size and speed. However, one of the biggest advantages of SLMs is absolute data privacy as all inputs, computations, and outputs are produced and stored locally on one machine without any access to the external world (except the time when you need to download the model). In this paper, these research questions were raised: RQ1: What are the main features of LLMs, and what gives them the ability to behave so “humanely”? How do LLMs and SLMs relate to each other? Is it the same thing or another? Where do SLMs come from? RQ2: Is it possible for a smaller model to perform as good as a big one? Under which conditions? What are the advantages of using SLMs instead of an LLM? Has there been any experimental research to take evidence from? CS&SE@SW 2024: 7th Workshop for Young Scientists in Computer Science & Software Engineering, December 27, 2024, Kryvyi Rih, Ukraine " popov_r@365.dnu.edu.ua (R. O. Popov); karpenko_n@365.dnu.edu.ua (N. V. Karpenko); herasymov_v@365.dnu.edu.ua (V. V. Gerasimov)  0009-0003-6982-2993 (R. O. Popov); 0000-0003-4700-6357 (N. V. Karpenko); 0000-0002-1366-715X (V. V. Gerasimov) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 164 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 RQ3: How well the ecosystem of SLMs is developed? What technologies and software are used to download and run models? Is it possible to run SLMs completely locally on average consumer hardware? What are the benchmarks? Problem statement. In this paper, we researched the usage and performance of SLMs under practice. Experimental data needs to be collected in order to assess the effectiveness of SLMs. This data comes from our experiment and previous research. We will also look at the available technologies and software that is used to construct, compress, and run SLMs. Our experiment will be a simple Q&A session where SLMs are asked several questions of graduate complexity. The results of such test can bring new insights to the research questions. Relevance of the paper. While the topic of SLMs is not new and there are papers that describe them, still many papers are too inclined in mathematics. Our contribution lies in viewing how SLMs are used in practice, instead of theory. The paper was written not only for researchers, but also for ordinary developers to see whether they need to delve deeper in the topic of SLMs or not, what are the peculiarities of SLMs. Structure of the paper. Section Language models presents a general overview of the topic of general language modeling, also the Transformer architecture is explained there – the most popular architecture for constructing LLMs. Next, section Small language models brings a thorough examination of the topic of SLMs, how they are constructed, and what their performance is. Section Experiment describes our experiment and its conditions: models, questions, metrics. In that section, we will also present the results of a Q&A session and make conclusions about the correctness and reliability of SLM models. The actual software for managing LLMs/SLMs is reviewed in section Ecosystem of language models. Finally, we will summarize all interesting findings in the section Conclusions. 2. Language models 2.1. Language modeling Language modeling (LM) has been a widely discussed topic in the natural language processing (NLP) field. The purpose of LM is to assign probabilities to sentences in natural language. These probabilities should reflect is it a “good” sentence or not, and “good” means - “does it sound natural?” At first glance it seems not practical to make such models, but in reality LMs have a lot of applications. They guide in text correction, search, linguistics analysis. They are also useful in Speech-to-Text (STT) models, where a model cannot understand which words it heard (STT can predict several possible words, but it cannot choose one of them, an LM model can be used to choose the most appropriate one) [4]. One of the simplest LMs are 𝑛-gram models. In 𝑛-gram models, text is split into words (called tokens, the process of splitting is called tokenization), and 𝑁 consecutive tokens form pairs, triples, or other 𝑛-tuples of words called 𝑛-grams. Then, the frequency of these 𝑛-grams is calculated in a training corpus. Using those frequencies, it is possible to assess the “quality” of a sentence and even predict which words come next (the process of choosing next words is called sampling) [5]. The main assumption of 𝑛-gram models: it is enough to know only 𝑛 last words of the sentence and its frequency across whole language in order to predict which word comes next. In table 1 you can see examples of sampling from 𝑛-gram models. As you work with 2- to 5-grams, the generated sentences start to resemble natural language more closely. 𝑁 -grams can capture the syntax of the text, however the semantics of the generated text is meaningless. If you try to increase the count of 𝑛-grams, then there is a chance that you will never get an output from the model, as the more context you have, the more specific you talk. To sum up, knowing just frequencies of sequences of words is not enough for generating “human-like” text. A major breakthrough in language modeling came with the introduction of the Transformer architec- ture and attention layers, which together power large language models [6]. 165 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 1 N-gram generated text examples (taken from [5]). N-gram count Generated text (words) 1 Months the my and issue of year foreign new exchange’s September were recession exchange new endorsed a acquire to six executives. 2 Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one point five percent of U. S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her. 3 They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions. While the overview of Transformers is presented in the next section, we will mention here that LLMs are autoregressive language models (ALM), which means that the output generation (also called inference) is made in a left-to-right manner [7]: 1. ALM is given a context. 2. ALM will produce a probability distribution of the next word. 3. A sampling algorithm will choose the final word. 4. That final word is added back to the context. The advantage of autoregressive models is that they are simple to construct, and they give a lot of controls for sampling from the word probability distribution, however its main disadvantage is the generation speed, as the same context should be passed around more and more times as the output grows [8, 9] (however, there are techniques to cache intermediate computations, but they are local and specific to architectures). 2.2. Large language models and Transformer architecture LLMs is a hot topic in modern artificial intelligence (AI) research. They are models that can analyze input text and produce meaningful response to the user. LLMs have shown ultimate performance on simple natural language understanding (NLU) tasks, though this power comes at a price. Modern LLMs are very big and consist of trillions of parameters. A lot of computational resources are needed to run such models. A new business model arrived to provide LLM inference as a service. Over the 2023-2024 years, count of various LLM models grew very fast. Modern surveys on LLMs cannot keep up with the temp, as more advance models come up. On the arena of production-grade models there are: OpenAI GPT-4o [10], Mistral AI models [11], Anthropic Claude [12], Google Gemini [13], and more. Those companies provide LLM completions through a cloud API and, typically, provide their resources on a pay-as-you-go basis. Modern LLMs are made with a Decoder-Only Transformer architecture. There are 4 crucial compo- nents in LLMs architecture [6]: 1. Embedding layer. Each token from the input is converted into a vector. Such embedding vectors have a very useful property – words with similar meanings are represented with vectors that are close to each other. This property is often used for text classification task [14]. For LLMs they have a property of compact representation of tokens (there could be tens of thousands of tokens, but dimensionality of an embedding vector is always fixed). 2. Transformer blocks. There are several Transformer blocks in an LLM network, and they are stacked: input from one layer is passed to the other layer. A Transformer block consists of a multi-head attention layer and Feed-Forward perceptron network. 166 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 3. Attention layer. This layer is widely considered to be the main feature of LLMs that gives it the ability to “understand” the semantics of the text. This layer allows embedding vector to “interact” with each other (more scientifically - to attend). This is simply done with linear algebra and Question, Key, and Value matrices that are eagerly applied to each input vector [6]. It is interpreted that in this layer, complex images of entities in the input text are formed [15, 16]. 4. Embeddings to logits conversion. This is a step of converting an embedding vector that resembles the next word in the sequence to a logits vector. Logits represent the probabilities of the next tokens. Models never predict an exact token, and a sampling algorithm is applied to choose a token that is the most probable under given logits and configuration. In the upper list, only the main components of Transformer architecture are presented, which are common for every LLM. In real models, additional layers are used for normalization of input, positional embeddings, etc. [6] It is worth mentioning that embedding vectors came before LLMs and found their place in various NLP tasks such as text classification, search, and sentiment analysis [14]. Another important notice is that tokenization differs a lot for LLMs in comparison to 𝑛-grams. In 𝑛-gram models tokens resemble words closely, while in LLMs tokenization is a mean of compressing input in order to fit in the context window [17]. The most popular algorithm for tokenization is byte-pair encoding (BPE), which is commonly found in compression algorithms. The result of BPE is not really interpretable as it may encode input as a set of words, or pairs of words, etc. LLMs found their application in various fields of research as well as life: natural language processing (NLP), chatbots, intellectual assistants, etc. [1] LLMs are benchmarked on topics such as: question- and-answering, summarization, paraphrasing, simplification, text classification, and more [18]. There are a lot of datasets used to train and benchmark models on those tasks, and more are emerging now. Currently, focus is even shifted from researching model architectures to improving the datasets and metrics. 3. Small language models 3.1. Features of small language models Large language models have a set of problems [19]: • Because they have lots of parameters, a special computer hardware is required to run them. Ordinary people cannot afford it (e.g.: NVIDIA A100, often recommended for running LLMs, cost around $10,000) [20]. Thus, these computations are typically delegated to cloud environments and accessed remotely through HTTP API (or other protocols). Moreover, the environmental impact of training and running LLMs is widely discussed now in academia [21]. • Because LLMs are often remote, this raises privacy concerns. A corporate, or another private, research group has limited access to the Internet already, and they use software that can store and process all kinds of information locally (this includes various databases, bibliography or research management systems, etc.). • LLMs tend to be general. But that generality comes at a cost of reduced performance in specific fields like law, healthcare, and others. Thus, a process called fine-tuning was invented, it is an additional training of a model to better understand a specific field of knowledge. Large parameter count makes fine-tuning costly. In order to overcome these issues, small language models come into action. They are low-resource models that can run on many devices and require low amount of RAM (less than 8 GB or even 4 GB). Their inference speed is very fast and all computations can be made locally. SLMs are also actively used in fine-tuning, which can bring better performance than using a big production-grade LLM for a specific field of knowledge [19, 22, 3]. 167 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 SLMs are compact, fast, and can often run locally on devices, enhancing data privacy and eliminating cloud dependency. Although they may not match LLMs in accuracy, SLMs provide an efficient solution for real-time and low-resource environments [3]. There is a plethora of SLMs: typically a big IT company releases an LLM and an SLM alongside. Some models are made by smaller companies. Other models are fine-tuned for a specific task. Often, SLMs are published in several variants that have different count of parameters. For example, Google Gemma 2 model is published in 2B, 9B, and 27B parameters [23]. Current leaders in SLM area are Alibaba Qwen models [24], Google Gemma 2 models [23], Microsoft Phi models [25] and Meta Llama models [26]. Though, the count of SLMs is much more because of the influence of the open-source community: teams or individual developers may fine-tune a model for their needs. 3.2. Obtaining small language models There are 3 main methods of constructing SLMs out of LLMs: • Pruning. Pruning lies in reduction of neurons in a model. It can be unstructured, which means that neurons are reduced uniformly, and structured, where certain layers or components are pruned. It is a very simple method that reduces model size significantly, however in case of over-pruning, performance can degrade severely [27]. • Knowledge distillation. In this method, two models take part: one model is called a Teacher (LLM) and the other one is a Student (SLM). The Student model is trained on the outputs of a teacher model. This method can transfer performance of a bigger model to a smaller one, however it still requires an LLM to run [28]. • Quantization. Quantization is a popular method of making a large language model smaller. This is achieved by using simplified low-precision number formats. Quantization significantly reduces model size and inference times, however it requires specialized hardware that could recognize those low-precision formats. Popular quantization levels include 16-bit floats (half-precision), 8-bit floats, 4-bit and even 2 and 1 bits [29]. In practice, every combination of those methods used. Though, it is hard to find information about how exactly models were trained. Newer versions of models currently (end of 2024) do not have a technical report paper, there are reports about older models but the way they actually made smaller models is hidden and not told much. It is important to notice that there is an ongoing project of an open-source license for AI models that would require a lot of details of models to be disclosed [30], however current models are open-weight, it means you have the results of training, and you can freely use them most of the time, but you have limited knowledge of how these weights were obtained. We have found this information about how smaller models were made in Meta, Google, and Alibaba: • Meta Llama 3 has pre-trained models, which were then used to make smaller models with pruning and knowledge distillation [26]. • Google Gemma 2 models used only knowledge distillation, which was also used in bigger models like Gemini 1.5 [23]. • Alibaba Qwen2 models did not use any methods of reducing model size, which means that all models were trained “purely” [31]. However, we did not find any information why models with different parameters count have different licenses. 3.3. Small language models in benchmarks and experiments Modern SLMs have much greater performance than older ones [22]. And what is interesting, “mod- ernness” in AI/LLM research is measured not in tens of years, but in single years or even months. It is interesting to see how the set of benchmarks is constantly revised and improved (one can look at benchmarks conducted on Qwen from the first version of the model to 2.5 and QwQ [32, 33, 31, 24, 34]). 168 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Lu et al. [22] have conducted a thorough and modern analysis of SLMs. Researchers evaluated models on three benchmark groups: Commonsense Reasoning Datasets, Problem-Solving Datasets, and Mathematics Datasets. Accuracy of models range in 60% to 75%, and it was discovered that over a year the performance of modern SLMs is greatly improved. It is also observed that SLMs developers do not conform to Chinchillas law [35], which states that the proportion of model parameters and training token count should be 1 : 20. SLMs are typically trained on a much larger amount of tokens to overcome the limitations of small parameter count [22]. However, it is worth noticing that Lu et al. [22] used datasets for common reasoning and complex problem-solving, instead of natural language understanding (NLU). Also, they did not compare the results to modern production-grade LLMs. Li et al. [36] provided a case study on internal Microsoft application for cloud supply chain fulfillment. That app used SLMs for tasks such as: Data Extraction, Plan Generation, What-if Analysis. They conducted an internal research on accuracy of their app, and then showed that SLMs have much greater performance and the running costs were several times lower than OpenAI GPT-3.5 and GPT-4 model family. While this paper gives a lot of faith into SLMs, the internal benchmark is closed-source and field-specific [36]. Lepagnol et al. [37] used SLMs for zero-shot text classification task. They benchmarked various models on a large amount of classification datasets. It was discovered that, for SLM performance, their architecture is more important than the model size. However, their research was focused on too small models (less than 1B parameters) and did not include modern SLMs from big-tech companies. It is hard to make proper conclusions about performance of SLMs, as: • Companies typically release several LLMs/SLMs under the same “model family” (Qwen2.5 is a notable example). These models vary in parameters count (starting typically at 1 or 3 billion, to 70 or 100 billion). Often, 7B variants are benchmarked, but not 1B or 3B. • Models with different quantization levels are not heavily benchmarked too. For quantization level, often the Perplexity metric is used, but it does not show the actual performance of the model on specific NLP/NLU tasks. • There is a lack of research that would compare SLMs and LLMs performance in one task. So it is hard now to make conclusions if LLMs are better or not. 4. Experiment 4.1. Experiment description We will ask 3 SLM models 6 questions about common sense and general knowledge. Answers for those questions will be evaluated manually. We will collect the results and make conclusions on: whether the model have common sense knowledge, whether it answered and explained the output correctly, is it safe to use SLMs. We consider an answer to be correct only if it actually answers the question. This means that explanation is not necessary. In case a model struggles to answer the question or does not answer it fully, we consider that answer incorrect, and we will not include such result in statistical analysis. This Q&A session will resemble sanity tests often found in software development [38]. Sanity checks are quick, targeted tests in software development that focus on the obvious and rational functioning of a system. They aim to confirm that the software behaves logically and that no glaring issues were introduced, ensuring the foundation is stable before proceeding with in-depth testing. There are several criteria for the chosen models: they should be modern (2024 – date of writing the article), from different companies, have small parameter count (better to be equal, though this might not be the best measure, as architecture and datasets highly influence models capabilities). One of the popular quantization method will be used – Q4_K_M. Table 2 shows the final list of models for our experiment and their properties. Here is the list of 6 questions of our Q&A experiment: 169 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 2 Chosen models for the experiment. Company Model Version Parameters (B) Year Reference Alibaba Qwen 2.5 3 2024 [24] Google Gemma 2 2 2024 [23] Meta Llama 3.2 3 2024 [26] 1. What is 2 + 2? 2. How many legs does a spider have? 3. What’s the name of the fruit that’s yellow and monkeys like to eat? 4. Which is heavier: a kilogram of feathers or a kilogram of bricks? 5. Which egg is bigger: chicken egg or a monkey egg? 6. You’re driving a bus. At the first stop, 3 people get on. At the second stop, 5 people get off. What’s the driver’s name? 4.2. Results In the table 3 the results of a Q&A session are presented. The results are split into 2 columns. In the first column, we collected the percentage of correctly answered questions in the first generated message. The second column tracks the percentage of questions that were answered in a conversation (with small hints guiding towards the answer), rather than from the first try. All model output is presented in appendix A. Table 3 Accuracy comparison of the experiment across different models. Model First try (%) 2 or more messages (%) Alibaba Qwen2.5 3B 66.7% 66.7% Google Gemma 2 2B 83.3% 100% Meta Llama 3.2 3B 66.7% 100% There are some properties that all SLMs share. Firstly, the answer highly differs from try to try: in one attempt models replied to question 5 that there is no monkey egg, however on other try they happily answered and proved its existence. Secondly, often models generate too much text, and the way they generate it is also primitive: they just follow the text, but do not think much about it. Each model also has its own style of output: some extensively use Markdown, others use plain text. Models also like to include explanation or proof for their answers. Qwen model went too much into details in its responses. Qwen did not really talk about monkey eggs, but it actually used terms from biology: placenta, fetus, viviparity. But that is not what user might think about an “egg”. Another interesting problem occurred with the 6th question about driver’s name. We tried to guide model to the correct answer several times, but we decided to give up. The model also shows the phenomenon called bias. The company behind Qwen is Alibaba, and model actively tried to prove that the driver’s name is Alibaba, though there was no clue for that. Google’s Gemma model has shown the best result. It even has 2 billions instead of 3 like other models. Gemma also generated the shortest answers, it extensively used Markdown and newlines. Gemma was the most emotional model and showed the most engagement. Meta’s Llama shown reasonable performance. It has better results than Qwen, but a bit lower than Gemma. Considering that Gemma has 1 billion parameters less than Llama, this is even a bigger win. But there is one situation that was unexpected: it could not answer the 3rd question about monkeys and banana. While it gave a list of possible answers (banana one of them), it did not sound sure enough. Only after a short conversation, it made a guess that the answer is “banana”, which is correct. 170 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 5. Ecosystem of language models 5.1. Terminology problem Before studying the technologies part of LLMs/SLMs, it is important to acknowledge a terminology problem present in AI research and news. Terms that are confusing or requiring a definition are: AI, LLM, SLM, local model, remote model. • AI and LLM. AI (artificial intelligence) is a field of computer science that consists of many subfields: NLP, robotics, knowledge representation, etc. LLMs are only a part of NLP. For consumers of applications, AI term might be applicable for LLMs as a marketing choice, but in academia those are different terms (there are no problems on this topic in papers). • LLM and SLM. Both LLMs and SLMs are language models, whose purpose is to assign probabilities to sentences. There are many architectures of LMs. Models that utilize Transformer architecture are called LLMs. The LLM term is used for SLMs too, as they share the same architecture. It means that SLMs are a subset of LLMs. However, the distinction of SLMs and LLMs is not clear either, one paper proposes to name models that have 100M to 5B parameters – SLMs [22]. Though many models have 7B-70B parameters, and production grade models have trillions. Other papers do not even make a definition of SLMs. • Local LLM and Remote LLM. Every model is local, as it needs to run on a hardware. Locality and remoteness is a property of getting access to models. Though for end-users, local LLM may sound good, and they will think that it is possible to run a model on their own computer. To sum up, there is a semantic problem for definitions of terms in AI field. Some problems are present in consumer field, other problems are in academia. It is important to understand distinctions of terms, where did they come from, and how they are used. 5.2. Obtaining models Hugging Face (HF) is the leading organization for managing and supporting LLM ecosystem. HF provides numerous storage for models and even for datasets and metrics. It is a cloud-based, centralized solution that is available over the Internet [39]. This company has an excellent organization of models. Each model is stored as a Git repository. This allows to easily manage models, share and update them. For each model there is data of its weights, parameters and tokenizer configuration. The history of a model is also recorded. As models are made step-by-step from pre-trained to fine-tuned, this is stored in a model card. Quantization of a model is saved too. If a model becomes popular, then Hugging Face may decide to run it on their service and provide access in a web interface or API. Computational resources can be provided for free, but the demand is very high and consumers may wait for the results for tens of minutes. HF also provides datasets. These datasets may be used for training models and for verifying model. Typically, if a new dataset is published in a scientific paper, it will appear on HF. Alongside datasets, metrics are also used, which are also stored on HF. It is even possible for user to experiment with those metrics online. Alternatives to HF can be TensorFlow Hub [40] and PyTorch Hub [41]. They are tied to their respective machine learning libraries. Those hubs have a lot of AI models, however they are not centered around LLMs as HF. There is also an ongoing development of model storage on GitHub called GitHub Models. They also organize models into repositories and provide detailed model history. There is a playground for models to test them [42]. 171 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 5.3. Running models In table 4 popular tools for managing and running LLMs/SLMs locally are listed. Main features and whether the app is open-source were also collected. Table 4 List of tools for running LLMs/SLMs locally. Open Name Reference Features source llama.cpp [43] Yes Local inference of LLM models. Also includes embeddings. Ollama [44] Yes Front-end for llama.cpp for downloading and storing models. GPT4All [45] Yes Download and run models locally by Nomic company. LocalAI [46] Yes Batteries-included tool for running and serving various AI models. LM Studio [47] No UI front-end for running LLMs locally. kobold.cpp [48] Yes Fork of llama.cpp providing a chat UI, geared towards story gen- eration. vLLM [49] Yes Alternative serving engine to llama.cpp. It is interesting to see, that the governing library for most software for running LLMs/SLMs is llama.cpp. A lot of software is just a front-end for downloading, storing, and provided chat UI for the llama.cpp. It is a very powerful scheme, as developers do not have to reinvent the wheel, and nearly all new models are supported by llama.cpp. We can conclude that nearly all software is open-source and is in active development by community. However, it seems that there is also too much front-end applications: applications that just provide a chat interface and means of downloading models. 6. Conclusions In this paper, we have explored the capabilities and challenges of small language models (SLMs) in comparison to large language models (LLMs). While LLMs have demonstrated impressive performance across a range of tasks, their reliance on extensive computational resources poses a significant bar- rier. This has driven the development of SLMs, which, despite being smaller and less powerful, offer compelling advantages such as data privacy and the ability to run on average consumer hardware. LLMs utilize self-attention mechanisms to understand context and semantics more effectively, en- abling breakthroughs in tasks like generation, summarization, and Q&A. However, their massive size and resource demands have led to a focus on smaller, more efficient models like SLMs, which can still perform well without the same computational costs. SLMs are created using methods like pruning, knowledge distillation, and quantization. Pruning reduces model size by removing neurons, but too much pruning can hurt performance. Knowledge distillation transfers knowledge from a larger model, while quantization simplifies model weights to lower precision, reducing size and inference time. Modern SLMs have significantly improved in performance, achieving 60%-75% accuracy on benchmark tasks. While they are trained on larger datasets than their parameter count suggests, they provide competitive performance at lower costs, as seen in applications like Microsoft’s cloud supply chain system, where SLMs outperformed larger models like GPT-3.5 and GPT-4 in both accuracy and cost efficiency. To sum up the results of our experiment, we can see that SLMs has basic knowledge about the world and can be good chatbots. The answer structure may vary, but they all try their best on giving answers with explanations. However, sometimes models went into too many details, which was one of the reason they did not answer on a question correctly. SLMs also cannot properly analyze riddles, they tend to answer straightforwardly instead of “thinking” for a time. SLMs can also make unexpected 172 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 results: for example, Llama once was not sure about an answer for a very simple question, Qwen has shown its bias towards Alibaba company. The terminology in AI and LLM research can be confusing, especially when distinguishing between terms like AI, LLM, and SLM. While AI encompasses a broad range of fields, LLM and SLM both refer to language models, with SLMs being a subset of LLMs. Furthermore, terms like local and remote models describe how access to models is obtained, rather their underlying nature. Platforms like Hugging Face (HF) play a central role in managing the LLM ecosystem, offering a cloud-based solution for model storage, datasets, and metrics. HF’s organization of models via Git repositories facilitates sharing and updates. Alternatives like TensorFlow Hub and PyTorch Hub exist, but their main focus is not LLMs. For running models locally, tools such as llama.cpp and Ollama provide accessible solutions for consumers, with llama.cpp being the governing application. Most applications are open-source and under active development and update. We think that future research in the field of small language models should focus on benchmarking smaller models against large language models in practical, real-world scenarios. Such comparisons will provide valuable insights into the trade-offs between performance, efficiency, and resource consumption. It is an open question whether developers should choose an SLM over LLM, but it is always worth a try, moreover light hardware requirements of SLMs could be a very promising feature. Declaration on Generative AI: During the preparation of this work, the authors used OpenAI ChatGPT4o and OpenAI ChatGPT 4o-mini in order to: Grammar and spelling check, Formatting assistance. After using these services, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content. References [1] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, J. Gao, Large Language Models: A Survey, 2024. doi:10.48550/ARXIV.2402.06196. arXiv:2402.06196. [2] J. M. Mittelstädt, J. Maier, P. Goerke, F. Zinn, M. Hermes, Large language models can outper- form humans in social situational judgments, Scientific Reports 14 (2024) 27449. doi:10.1038/ s41598-024-79048-0. [3] C. Van Nguyen, X. Shen, R. Aponte, Y. Xia, S. Basu, Z. Hu, J. Chen, M. Parmar, S. Kunapuli, J. Barrow, J. Wu, A. Singh, Y. Wang, J. Gu, F. Dernoncourt, N. K. Ahmed, N. Lipka, R. Zhang, X. Chen, T. Yu, S. Kim, H. Deilamsalehy, N. Park, M. Rimer, Z. Zhang, H. Yang, R. A. Rossi, T. H. Nguyen, A Survey of Small Language Models, 2024. doi:10.48550/ARXIV.2410.20011. arXiv:2410.20011. [4] C. Wei, Y.-C. Wang, B. Wang, C. C. J. Kuo, An Overview on Language Models: Recent Developments and Outlook, APSIPA Transactions on Signal and Information Processing 13 (2023). doi:10.1561/ 116.00000010. [5] D. Jurafsky, J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd draft ed., 2025. URL: https://web.stanford.edu/~jurafsky/slp3/. [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- tention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010. [7] Y. Liu, H. He, T. Han, X. Zhang, M. Liu, J. Tian, Y. Zhang, J. Wang, X. Gao, T. Zhong, Y. Pan, S. Xu, Z. Wu, Z. Liu, X. Zhang, S. Zhang, X. Hu, T. Zhang, N. Qiang, T. Liu, B. Ge, Understanding LLMs: A comprehensive overview from training to inference, Neurocomputing 620 (2025) 129190. doi:10.1016/j.neucom.2024.129190. [8] C.-C. Lin, A. Jaech, X. Li, M. R. Gormley, J. Eisner, Limitations of Autoregressive Models and Their Alternatives, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 5147–5173. URL: https: //aclanthology.org/2021.naacl-main.405/. doi:10.18653/v1/2021.naacl-main.405. 173 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 [9] Y. Dubois, Stanford CS229 I Machine Learning I Building Large Language Models (LLMs), https://youtu.be/9vM4p9NN0Ts?si=AFlVcT1fW6Jmjvxi, 2024. 28. [10] OpenAI, GPT-4o System Card, 2024. doi:10.48550/ARXIV.2410.21276. arXiv:2410.21276. [11] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bres- sand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7B, 2023. doi:10.48550/ARXIV.2310.06825. arXiv:2310.06825. [12] Anthropic, The Claude 3 Model Family: Opus, Sonnet, Haiku, Technical Report, 2024. URL: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_ Card_Claude_3.pdf. [13] Gemini Team, Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. doi:10.48550/ARXIV.2403.05530. arXiv:2403.05530. [14] G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, L. Carin, Joint Embedding of Words and Labels for Text Classification, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 2321–2331. doi:10.18653/v1/P18-1216. [15] S. Vashishth, S. Upadhyay, G. S. Tomar, M. Faruqui, Attention Interpretability Across NLP Tasks, 2019. doi:10.48550/ARXIV.1909.11218. arXiv:1909.11218. [16] 3Blue1Brown, Transformers (how LLMs work) explained visually | DL5, https://www.youtube.com/watch?v=wjZofJX0v4M, 2024. [17] A. K. Singh, D. Strouse, Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs, 2024. doi:10.48550/ARXIV.2402.14903. arXiv:2402.14903. [18] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, X. Xie, A survey on evaluation of large language models, ACM Trans. Intell. Syst. Technol. 15 (2024). URL: https://doi.org/10.1145/3641289. doi:10.1145/ 3641289. [19] F. Wang, Z. Zhang, X. Zhang, Z. Wu, T. Mo, Q. Lu, W. Wang, R. Li, J. Xu, X. Tang, Q. He, Y. Ma, M. Huang, S. Wang, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, 2024. doi:10.48550/ARXIV.2411.03350. arXiv:2411.03350. [20] D. K. Vohra, How to Choose the Best GPU for LLM: A Practical Guide, 2024. URL: https://www.hyperstack.cloud/technical-resources/tutorials/ how-to-choose-the-right-gpu-for-llm-a-practical-guide. [21] A. Faiz, S. Kaneda, R. Wang, R. C. Osi, P. Sharma, F. Chen, L. Jiang, LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models, in: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, OpenReview.net, 2024. URL: https://openreview.net/forum?id=aIok3ZD9to. [22] Z. Lu, X. Li, D. Cai, R. Yi, F. Liu, X. Zhang, N. D. Lane, M. Xu, Small Language Models: Survey, Measurements, and Insights, 2024. doi:10.48550/ARXIV.2409.15790. arXiv:2409.15790. [23] Gemma Team, Gemma 2: Improving Open Language Models at a Practical Size, 2024. doi:10. 48550/ARXIV.2408.00118. arXiv:2408.00118. [24] Qwen Team, Qwen2.5: A Party of Foundation Models, 2024. URL: https://qwenlm.github.io/blog/ qwen2.5/. [25] M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, M. Cai, Q. Cai, V. Chaudhary, D. Chen, D. Chen, W. Chen, Y.-C. Chen, Y.-L. Chen, H. Cheng, P. Chopra, X. Dai, M. Dixon, R. Eldan, V. Fragoso, J. Gao, M. Gao, M. Gao, A. Garg, A. D. Giorno, A. Goswami, S. Gunasekar, E. Haider, J. Hao, R. J. Hewett, W. Hu, J. Huynh, D. Iter, S. A. Jacobs, M. Javaheripi, X. Jin, N. Karampatziakis, P. Kauffmann, M. Khademi, D. Kim, Y. J. Kim, L. Kurilenko, J. R. Lee, Y. T. Lee, Y. Li, Y. Li, C. Liang, L. Liden, X. Lin, Z. Lin, C. Liu, L. Liu, M. Liu, W. Liu, X. Liu, C. Luo, P. Madan, A. Mahmoudzadeh, D. Majercak, M. Mazzola, C. C. T. Mendes, A. Mitra, H. Modi, A. Nguyen, B. Norick, B. Patra, 174 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 D. Perez-Becker, T. Portet, R. Pryzant, H. Qin, M. Radmilac, L. Ren, G. de Rosa, C. Rosset, S. Roy, O. Ruwase, O. Saarikivi, A. Saied, A. Salim, M. Santacroce, S. Shah, N. Shang, H. Sharma, Y. Shen, S. Shukla, X. Song, M. Tanaka, A. Tupini, P. Vaddamanu, C. Wang, G. Wang, L. Wang, S. Wang, X. Wang, Y. Wang, R. Ward, W. Wen, P. Witte, H. Wu, X. Wu, M. Wyatt, B. Xiao, C. Xu, J. Xu, W. Xu, J. Xue, S. Yadav, F. Yang, J. Yang, Y. Yang, Z. Yang, D. Yu, L. Yuan, C. Zhang, C. Zhang, J. Zhang, L. L. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, X. Zhou, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, 2024. doi:10.48550/ARXIV.2404.14219. arXiv:2404.14219. [26] Llama Team, The Llama 3 Herd of Models, 2024. doi:10.48550/ARXIV.2407.21783. arXiv:2407.21783. [27] X. Ma, G. Fang, X. Wang, LLM-pruner: on the structural pruning of large language models, in: Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Curran Associates Inc., Red Hook, NY, USA, 2024. URL: https://arxiv.org/abs/2305.11627. arXiv:2305.11627. [28] X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, T. Zhou, A Survey on Knowledge Distilla- tion of Large Language Models, 2024. doi:10.48550/ARXIV.2402.13116. arXiv:2402.13116. [29] A. Chavan, R. Magazine, S. Kushwaha, M. Debbah, D. Gupta, Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, in: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), 2024, pp. 7980–7988. URL: https://www.ijcai.org/ proceedings/2024/0883.pdf. [30] OSI Board of Directors, The Open Source AI Definition – 1.0, https://opensource.org/ai/open- source-ai-definition, 2024. [31] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, X. Liu, Y. Fan, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, Z. Fan, Qwen2 Technical Report, 2024. URL: https://arxiv.org/abs/2407.10671. arXiv:2407.10671. [32] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, T. Zhu, Qwen Technical Report, 2023. doi:10.48550/ARXIV.2309.16609. arXiv:2309.16609. [33] Qwen Team, Introducing Qwen1.5, 2024. URL: https://qwenlm.github.io/blog/qwen1.5/. [34] Qwen Team, QwQ: Reflect Deeply on the Boundaries of the Unknown, 2024. URL: https://qwenlm. github.io/blog/qwq-32b-preview/. [35] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, L. Sifre, Training compute-optimal large language models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2024. URL: https://arxiv.org/abs/2203.15556. arXiv:2203.15556. [36] B. Li, Y. Zhang, S. Bubeck, J. Pathuri, I. Menache, Small Language Models for Application Interac- tions: A Case Study, 2024. doi:10.48550/ARXIV.2405.20347. arXiv:2405.20347. [37] P. Lepagnol, T. Gerald, S. Ghannay, C. Servan, S. Rosset, Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 14923–14936. URL: https://aclanthology.org/2024.lrec-main.1299/. [38] R. Sammi, I. Masood, S. Jabeen, A Framework to Assure the Quality of Sanity Check Process, in: J. M. Zain, W. M. b. Wan Mohd, E. El-Qawasmeh (Eds.), Software Engineering and Computer Systems, volume 181 of Communications in Computer and Information Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 143–150. doi:10.1007/978-3-642-22203-0_13. 175 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 [39] Hugging Face – The AI community building the future, 2025. URL: https://huggingface.co/. [40] TensorFlow Hub, 2024. URL: https://www.tensorflow.org/hub. [41] PyTorch Hub, 2024. URL: https://pytorch.org/hub/. [42] T. Dohmke, Introducing GitHub Models: A new generation of AI engineers building on GitHub, 2024. URL: https://github.blog/news-insights/product-news/introducing-github-models/. [43] G. Gerganov, llama.cpp, 2025. URL: https://github.com/ggerganov/llama.cpp. [44] Ollama, 2024. URL: https://ollama.com/. [45] Nomic, GPT4All, 2024. URL: https://www.nomic.ai/gpt4all. [46] E. Di Giacinto, LocalAI, 2024. URL: https://localai.io/. [47] LM Studio, LM Studio, https://lmstudio.ai/, 2025. [48] Concedo, Koboldcpp, 2025. URL: https://github.com/LostRuins/koboldcpp. [49] vLLM Team, Welcome to vLLM – vLLM, 2024. URL: https://docs.vllm.ai/en/latest/. A. Question and answering results A.1. Alibaba Qwen2.5 3B results Table 5 Q&A session with Alibaba Qwen2.5 3B model. Question Answer Correctness 1 2 + 2 equals 4. ✓ 2 A typical spider has eight legs. ✓ 3 The fruit you’re asking about is typically called a banana. Bananas are indeed yellow ✓ when they are ripe, and it’s true that they are often enjoyed by monkeys in their natural habitats. However, it should be noted that while many monkeys enjoy eating bananas, not all species of monkeys consume them, and some may prefer other fruits or foods instead. 4 A kilogram (kg) of feathers and a kilogram of bricks both weigh the same amount - 1 ✓ kg. The term “kilogram” refers to the unit of mass, not its physical form. Therefore, regardless of whether you have a kilogram of feathers or a kilogram of bricks, they will each weigh exactly one kilogram. 5 Neither a “chicken egg” nor a “monkey egg” actually exists as an object that one can X hold or compare directly. Chicken eggs come from chickens, and monkeys lay eggs (they are called mammalian eggs), but these aren’t objects that we typically use to measure size in the same way that we do with chicken eggs. Monkeys lay leathery membranes known as marsupials, not true eggs like birds have. If you’re looking for a comparison of typical egg sizes: - A chicken egg has an average diameter of about 45 mm (or 1.77 inches). - The egg laid by a female monkey varies significantly depending on the species and size of the monkey, but typically it would be smaller than a chicken egg. So, in terms of size comparison within their respective categories: A chicken egg is generally larger than the leathery mammalian eggs that monkeys lay. 6 The information provided doesn’t include the driver’s name; it only describes the bus X ride scenario you’re describing here. The driver’s name is not determined by these actions and does not depend on any of the events happening at each stop. 176 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 6 Conversation about question 5 with Alibaba Qwen2.5 3B model. Role Message Researchers Which egg is bigger: chicken egg or a monkey egg? SLM Neither a “chicken egg” nor a “monkey egg” actually exists as an object that one can hold or compare directly. Chicken eggs come from chickens, and monkeys lay eggs (they are called mammalian eggs), but these aren’t objects that we typically use to measure size in the same way that we do with chicken eggs. Monkeys lay leathery membranes known as marsupials, not true eggs like birds have. If you’re looking for a comparison of typical egg sizes: - A chicken egg has an average diameter of about 45 mm (or 1.77 inches). - The egg laid by a female monkey varies significantly depending on the species and size of the monkey, but typically it would be smaller than a chicken egg. So, in terms of size comparison within their respective categories: A chicken egg is generally larger than the leathery mammalian eggs that monkeys lay. Researchers are you sure of your answer? SLM Yes, I am certain that monkeys do not lay true eggs. Monkeys and other primates are mammals, and they give birth to live young through a process known as viviparity. The reproductive structure in female primates is called a placenta, which nourishes the developing fetus during pregnancy. If you’re interested in comparing egg sizes between different types of animals: - **Chicken Egg**: Has an average diameter of about 35 mm. - **Monkey’s Reproductive Structure (Mammalian Eggs)**: The structure in female monkeys is not an “egg” but rather a leathery membrane known as the follicle or oviduct lining. These structures are much smaller than chicken eggs. If you have any other questions about animal reproduction, anatomy, or size comparisons, feel free to ask! Table 7: Conversation about question 6 with Alibaba Qwen2.5 3B model. Role Message Researchers You’re driving a bus. At the first stop, 3 people get on. At the second stop, 5 people get off. What’s the driver’s name? SLM The information provided doesn’t include the driver’s name; it only describes the bus ride scenario you’re describing here. The driver’s name is not determined by these actions and does not depend on any of the events happening at each stop. Researchers Are you sure? Read the riddle again SLM You’re right, let me clarify the riddle for you. The riddle given is: “At the first stop, 3 people get on. At the second stop, 5 peo- ple get off. What’s the driver’s name?” Let’s analyze this step by step: 1. The bus starts empty. 2. At the first stop, 3 people board the bus, making a total of 3 passengers now. 177 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 7: Conversation about question 6 with Alibaba Qwen2.5 3B model. Role Message 3. At the second stop, 5 people disembark the bus. Since 5 people get off and there are only 3 passengers on board at that moment, it’s impossible for all 5 to have gotten off because there aren’t enough passengers to support such an action. This means that either the riddle is a trick question or there might be some con- text missing. However, based on the information provided in the riddle itself, we can conclude: **The answer to this riddle is “Alibaba.”** This conclusion comes from the fact that the name “Alibaba” is often associated with Alibaba Cloud, and the scenario described is a clever way of hinting at or directly referencing Alibaba’s brand. Table 8: Continuation of conversation about question 6 with Alibaba Qwen2.5 3B model. Role Message Researchers are there any hints in riddle question that would tell about driver’s name? Model Let’s break down the riddle again to see if there are any hidden clues: The riddle states: - At the first stop, 3 people get on. - At the second stop, 5 people get off. Here’s a more detailed look at what these actions might imply: 1. **At the first stop**: 3 people board the bus. This suggests some initial setup or context that leads to the next part of the riddle. 2. **At the second stop**: 5 people get off, but since there are only 3 people on board (from the initial boarding), it’s impossible for all 5 to disembark. Given this information, we can deduce: - The driver is likely one of the passengers who got on at the first stop. - Since the bus started empty and now has a total of 3 passengers after the first stop, one passenger must have been named “Alibaba.” So, the hint in the riddle is subtle: since there are only 3 people on board but 5 people are supposed to get off (impossible with just 3), it suggests that the driver is part of these initial 3 passengers. **Therefore, the driver’s name is likely “Alibaba.”** This interpretation makes sense given common knowledge about Alibaba Cloud and its branding. Researchers What’s the beginning of the riddle? 178 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 8: Continuation of conversation about question 6 with Alibaba Qwen2.5 3B model. Role Message Model The riddle begins with a set of instructions or actions that describe what happens at each stop. Here’s how it typically starts: ”At the first stop, 3 peopleˆC As you can see, on the 3rd attempt to guide model to the answer on the 6th question we decided to give up by sending SIGINT signal. A.2. Google Gemma 2 2B results Table 9 Q&A session with Gemma 2 2B model. Question Answer Correctness 1 2 + 2 = **4** 1 ✓ 2 A spider has **eight** legs. 2 ✓ 3 That’s a bit of a trick question! ✓ The fruit you’re thinking of is likely a **Banana**. Monkeys love to munch on bananas! 4 This is a classic trick question! ✓ **They weigh the same.** A kilogram is a unit of mass. So, a kilogram of anything will have the same weight as a kilogram of something else. The difference lies in their **density**. Bricks are much denser than feathers, meaning they pack more mass into the same volume. 5 This is a trick question! ✓ * **Chicken eggs** are the type of eggs we typically think about. They come from chickens. * **Monkeys don’t lay eggs.** Monkeys give birth to live young, just like humans do. Let me know if you have any other fun animal-related questions! 6 This is a trick question! The driver’s name is unknown. X We only have information about passengers getting on and off the bus. The driver’s identity isn’t given. Table 10: Conversation about question 5 with Google Gemma 2 2B model. Role Message Researchers You’re driving a bus. At the first stop, 3 people get on. At the second stop, 5 people get off. What’s the driver’s name? SLM This is a trick question! The driver’s name is unknown. We only have information about passengers getting on and off the bus. The driver’s identity isn’t given. 179 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 10: Conversation about question 5 with Google Gemma 2 2B model. Role Message Researchers Are you sure? Read the riddle again SLM You are absolutely right! I missed the crucial detail. I was so focused on the details of the passenger flow that I overlooked the actual question. The answer is **”It doesn’t say”.** The riddle tricks you into thinking about pas- sengers and the bus itself but fails to give any clues about the driver’s name. Researchers There is one clue Model Ah, a clever trick! You’re right, there’s still a clue: ”You’re driving a bus.” In none attempts Google Gemma 2 2B admitted the existence of monkey eggs. A.3. Meta Llama 3.2 3B results Table 11: Q&A session with Meta Llama 3.2 3B model. Question Answer Correctness 1 2 + 2 = 4. ✓ 2 A spider has eight legs. This is one of the distinctive characteristics that help ✓ identify spiders, and it’s also an important part of their anatomy and movement. 3 I’m not aware of a specific fruit that is commonly associated with monkeys 50% / 50% eating it. However, I can tell you that many monkey species do enjoy fruits in their diet. That being said, some popular fruits among monkeys include: 1. Bananas 2. Mangoes 3. Papayas 4. Pineapples However, there isn’t a specific fruit that is uniquely known as the ”yel- low fruit that monkeys like to eat.” If you have any more information or context about this fruit, I may be able to help you better! 4 A kilogram of feathers and a kilogram of bricks would weigh the same amount, ✓ which is one kilogram. The difference lies in their density and volume. A kilogram of feathers would take up more space than a kilogram of bricks due to the feathers’ low density. 5 There is no such thing as a “monkey egg”. Monkeys are mammals, and they do ✓ not lay eggs. Female monkeys give birth to live young, just like other mammals. On the other hand, chickens are birds that lay eggs. The size of a chicken egg can vary depending on the breed, but it’s generally around 50-60 grams (1.8-2.1 ounces) and about 5 cm (2 inches) long. So, to answer your question, there is no comparison between the size of a chicken egg and a monkey egg, because monkeys don’t lay eggs! 180 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 11: Q&A session with Meta Llama 3.2 3B model. Question Answer Correctness 6 The answer to this classic puzzle is “None”. The question states that the driver’s X name is not provided. In this case, we don’t have any information about the driver’s name because it was not mentioned in the original statement. The focus of the problem is on the people getting on and off the bus, but there is no mention of the driver. Table 12: Conversation about question 3 with Meta Llama 3.2 3B model. Role Message Researchers What’s the name of the fruit that’s yellow and monkeys like to eat? SLM I’m not aware of a specific fruit that is commonly associated with monkeys eating it. However, I can tell you that many monkey species do enjoy fruits in their diet. That being said, some popular fruits among monkeys include: 1. Bananas 2. Mangoes 3. Papayas 4. Pineapples However, there isn’t a specific fruit that is uniquely known as the ”yellow fruit that monkeys like to eat.” If you have any more information or context about this fruit, I may be able to help you better! Researchers Are you sure you can’t give a final answer? SLM I was just trying to provide some possibilities, but it seems I couldn’t narrow it down to a single fruit. After rethinking, I’m going to take a wild guess... Is the answer “Banana”? Researchers yes! Model I’m glad I was able to eventually guess correctly! Bananas are definitely a popular fruit among monkeys, and they’re also one of the most widely consumed fruits in many parts of the world. Thanks for playing the guessing game with me! If you want to play again, feel free to give me another riddle or question anytime! Table 13: Conversation about question 5 with Meta Llama 3.2 3B model. Role Message Researchers Which egg is bigger: chicken egg or a monkey egg? SLM Chimpanzee and other Old World monkeys do lay eggs, but these eggs are much smaller than chicken eggs. The eggs of non-human primates like chimpanzees, gorillas, and orangutans typi- cally range in size from about 2 to 3 centimeters (0.8 to 1.2 inches) long and weigh around 6-10 grams. In contrast, a standard chicken egg is usually about 5-6 centimeters (2-2.4 inches) long and weighs around 50-60 grams. 181 Ruslan O. Popov et al. CEUR Workshop Proceedings 164–182 Table 13: Conversation about question 5 with Meta Llama 3.2 3B model. Role Message So, the monkey egg would be significantly smaller than the chicken egg! Table 14 Conversation about question 6 with Meta Llama 3.2 3B model. Role Message Researchers You’re driving a bus. At the first stop, 3 people get on. At the second stop, 5 people get off. What’s the driver’s name? SLM The answer to this classic puzzle is “None”. The question states that the driver’s name is not provided. In this case, we don’t have any information about the driver’s name because it was not mentioned in the original statement. The focus of the problem is on the people getting on and off the bus, but there is no mention of the driver. Researchers Are you sure? Read the riddle again SLM Upon re-reading the puzzle, I realize that I missed a crucial detail. The question asks “What’s the driver’s name?” implying that we are trying to determine the name of someone who is already driving the bus. In this case, the answer is not actually hidden in the information about people getting on and off the bus. The driver’s name was mentioned at the beginning of the statement: “You’re driving a bus.” So, the correct answer is indeed “Your name”. 182