=Paper=
{{Paper
|id=Vol-3864/quasoq-2024-paper-05
|storemode=property
|title=Challenges in Adopting LLaMA: An Empirical Study of Discussions on Stack Overflow
|pdfUrl=https://ceur-ws.org/Vol-3864/quasoq-2024-paper-05.pdf
|volume=Vol-3864
|authors=Ramita Deeprom,Shiyu Yang,Yoshiki Higo,Morakot Choetkiertikul,Chaiyong Ragkhitwetsagul
|dblpUrl=https://dblp.org/rec/conf/apsec/DeepromYHCR24
}}
==Challenges in Adopting LLaMA: An Empirical Study of Discussions on Stack Overflow==
Challenges in Adopting LLaMA: An Empirical Study of
Discussions on Stack Overflow
Ramita Deeprom1,* , Shiyu Yang2 , Yoshiki Higo2 , Morakot Choetkiertikul1 and
Chaiyong Ragkhitwetsagul1
1
Faculty of Information and Communication Technology, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhon Pathom 73170 THAILAND
2
Graduate School of Information Science and Technology, Osaka University 1-5, Yamadaoka, Suita, Osaka, 565-0871, Japan
Abstract
LLaMA (Large Language Model Meta AI) has quickly gained traction among developers due to its wide-ranging applications and its
capabilities to be integrated into software projects. As interest in LLaMA grows, discussions around it have surged on platforms like
Stack Overflow. The developer community, with its collaborative nature, serves as a valuable source for studying LLaMA’s quality, its
emerging trends, and insights into its usage. Despite this growing attention, there has been no comprehensive study examining how the
community interacts with and discusses LLaMA. This study addresses that gap by exploring conversations on Stack Overflow related to
LLaMA and its quality, with the objective of identifying key themes and recurring patterns in these discussions. We systematically
collected and analyzed 473 posts from Stack Overflow that contained the keyword “LLaMA” or were tagged accordingly. The analysis
revealed that prominent topics of discussion include model configuration, error handling, and integration with other technologies.
Furthermore, we identified frequent co-occurring tags, underscoring LLaMA’s integration within the larger ecosystem of large language
models and its interoperability with widely used frameworks, such as Python and Hugging Face Transformers. The findings highlight
the complexity of working with LLaMA, especially in model configuration and fine-tuning, indicating a need for better resources,
documentation, and community support. The study also suggests that future development should prioritize interoperability with
popular machine-learning frameworks to improve the LLM’s quality and to strengthen LLaMA’s role in the AI ecosystem.
Keywords
LLaMA, Stack Overflow, Large Language Models’ Quality
1. Introduction platforms like Stack Overflow (SO),2 an online community
where developers ask questions, share knowledge, and pro-
The rapid advancements in artificial intelligence (AI) have vide solutions related to software development and technol-
revolutionized the field of technology, leading to the cre- ogy. SO has become one of the most widely used platforms
ation of powerful large language models (LLMs) that are for developers to collaborate, troubleshoot, and learn from
transforming how developers and organizations approach each other, making it a rich source of information about
problem-solving. One such model is Meta’s LLaMA1 , an real-world challenges and practical applications of various
open-source LLM that has garnered substantial attention technologies. Studying SO is essential because it reflects the
from the developer community [1]. Unlike many propri- collective experiences and expertise of a global community
etary models, LLaMA offers developers the flexibility to of developers, providing valuable insights into the quality,
fine-tune and customize the model for specific use cases, common issues, and trends that arise with new technologies
making it an attractive alternative for those who require like LLaMA. By examining the discussions on SO, we can
more control and adaptability in their applications [1]. better understand not only the key themes and challenges
Recent studies have demonstrated LLaMA’s superior per- developers face with LLaMA but also the broader context
formance in specific domain tasks, such as cheminformat- of its integration and adoption in various fields. This under-
ics, where it has outperformed models like ChatGPT in standing is critical for identifying areas where additional
tasks such as SMILES embeddings for predicting molecu- support, documentation, or tools might be needed to im-
lar properties and drug-drug interactions (DDI) [2]. This prove the developer experience and further promote the
suggests that LLaMA is particularly effective in tasks that effective use of LLaMA.
demand high degree of precision and domain-specific ex- This study aims to address an initial gap by conducting
pertise, setting it apart from other LLMs. While models an empirical analysis of Stack Overflow posts tagged with
like ChatGPT, Bard, and Ernie may offer unique features LLaMA to identify the predominant discussion topics related
such as real-time web access or higher computational effi- to its quality and adoption, and associated technologies. By
ciency, LLaMA stands out by providing a well-rounded bal- employing keyword frequency analysis and categorizing the
ance across various criteria, making it suitable for a broader posts, this study seeks to answer two key research questions:
range of applications [3]. (1) What are the main topics of discussion regarding LLaMA
The growing interest in LLaMA is particularly evident on on Stack Overflow? and (2) What related themes emerge in
QuASoQ 2024: 12th International Workshop on Quantitative Approaches
these discussions? Through this initial analysis, we aim to
to Software Quality, December 03, 2024, Chongqing, China provide early insights into the specific challenges developers
*
Corresponding author. face, the solutions they seek, and the broader implications
$ ramita.dep@student.mahidol.ac.th (R. Deeprom); for LLaMA’s role within the AI ecosystem. The findings
yangsy@ist.osaka-u.ac.jp (S. Yang); higo@ist.osaka-u.ac.jp (Y. Higo); from this research study will serve as a foundation for a
morakot.cho@mahidol.ac.th (M. Choetkiertikul);
chaiyong.rag@mahidol.ac.th (C. Ragkhitwetsagul)
more comprehensive future study, contributing valuable
https://ysy-dlg.github.io/MyHomePage (S. Yang); insights to both practitioners and researchers as we further
https://sites.google.com/view/yhigo/home (Y. Higo); our understanding of LLaMA’s use and integration within
https://morakotch.wordpress.com/ (M. Choetkiertikul); diverse technical environments.
https://cragkhit.github.io/ (C. Ragkhitwetsagul) The structure of this paper is as follows. Section 2 pro-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
1 2
https://github.com/meta-llama/llama https://stackoverflow.com/
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
35
vides the background and related work, detailing prior re- LLMs. Similarly, comparative analyses have shown that
search on the adoption of large language models (LLMs) while ChatGPT and other models like Bard and Ernie offer
such as LLaMA and their application in real-world scenar- advantages in certain areas, such as real-time internet ac-
ios. The methodology employed in our research, including cess or computational efficiency, LLaMA provides balanced
data collection and preprocessing techniques, is explained performance across multiple criteria, making it a versatile
in Section 3. Section 4 presents the results of our empirical tool for various applications [3].
study, focusing on the analysis of Stack Overflow discus- Moreover, the performance of Llama 2 has been noted
sions to answer the research questions posed in this study. to exhibit minimal variation across different languages, of-
We then discuss the implications of our findings in Section 5, fering consistency in sentiment analysis tasks. However,
where we highlight the key challenges faced by developers this consistency sometimes comes at the cost of skewing
when working with LLaMA and suggest potential improve- ratings towards positive sentiment, even in scenarios where
ments for future development. Finally, Section 6 concludes more nuanced interpretations are required [9]. Further-
the paper and outlines potential avenues for future research, more, recent studies on job recommendations generated
such as expanding the dataset and exploring more advanced by LLaMA reveal both strengths and limitations. While
stages of LLaMA adoption. LLaMA suggests a wider variety of professions compared to
ChatGPT, its recommendations often include impractical or
nonsensical roles, reflecting a trade-off between diversity
2. Background and Related Work and practicality [10]. This indicates the need for improved
prompt engineering and bias mitigation in LLM applications
The rapid adoption of generative AI, particularly large lan-
to ensure fairer and more relevant outcomes across diverse
guage models (LLMs), has sparked significant interest in
user groups.
understanding how users are integrating these tools into
Several studies have leveraged Stack Overflow data to
their workflows. Previous research shows that many profes-
analyze trends within the developer community, providing
sionals increasingly rely on generative AI, such as ChatGPT
insights into quality, common challenges, emerging tech-
and LLaMA, to solve problems traditionally addressed on
nologies, and evolving developer needs. Silva et al. [11]
platforms like Stack Overflow (SO) [4, 5]. This shift sug-
report that ChatGPT has significantly impacted SO, offering
gests a change in the problem-solving paradigm, where
fast, human-like responses that have raised questions about
AI-generated solutions are becoming a first resort for many
the platform’s future in the AI era. The study noted a decline
developers, streamlining the troubleshooting process and
in overall SO activity, though some communities remain ac-
improving efficiency [4]. However, despite the growing re-
tive. Both models excel at addressing general programming
liance on AI, recent studies indicate that not all users are
queries but struggle with specific frameworks and libraries,
fully satisfied with AI-generated responses. Some develop-
leading developers to return to SO when LLMs fall short.
ers still face challenges, particularly with complex technical
Similarly, Zhong et al. [6] developed the RobustAPI dataset,
issues, prompting them to seek human-based community
featuring 1,208 coding questions from SO related to 18 Java
support on platforms like SO [6, 5]. This highlights the limi-
APIs. Their study revealed that even advanced models like
tations of AI models in delivering contextually accurate and
GPT-4 produced API misuses in 62% of the generated code,
reliable answers for more nuanced problems [7, 5].
posing risks when applied to real-world software develop-
LLaMA, an open-source LLM created by Meta, offers no-
ment.
table advantages that contribute to its rising popularity
Nonetheless, there is no study that investigates the quality
within the developer community. Released to the public
of LLaMA and its adoption in practice. This study fills in
in February 2023, with LLaMA 3.1 debuting in July 2024,
the gap by studying the discussions related to LLaMA on
the model has garnered over 300 million downloads glob-
SO discussions.
ally, underscoring its widespread adoption [8]. Compared
to ChatGPT, LLaMA is perceived as more complex to install
and configure, yet its appeal lies in its ability to provide 3. Methodology
fine-tuned, context-specific outputs, making it particularly
attractive to developers who require precision and control As shown in Figure 1, a motivating example is a Stack
[8, 3]. Furthermore, LLaMA’s enhanced security features Overflow post where a user inquires about installing the
and the ability to be hosted internally within organizations LLaMA-cpp-python package. This post has garnered
without the risk of leaking sensitive information make it a 38,975 views (at the time of writing), illustrating the
strong contender for enterprise use cases [3]. These charac- widespread interest in LLaMA but also highlighting that
teristics reduce the risk of biased outputs, which is often a developers frequently encounter challenges requiring exter-
concern for beginners relying too heavily on AI-generated nal help. Despite its growing popularity, the installation and
responses [3]. The model’s open-source nature also allows configuration of LLaMA packages remain common stum-
for greater flexibility in integration and customization, of- bling blocks.
fering experienced developers a robust tool for specialized In light of this, our research focuses on examining the
applications [3, 2]. discussions surrounding LLaMA on Stack Overflow. By
Studies have highlighted that LLaMA excels in certain analyzing these interactions, we aim to uncover the most
domain-specific tasks, such as cheminformatics, where it prevalent issues and limitations faced by developers work-
outperforms ChatGPT in Simplified Molecular Input Line ing with LLaMA. This study not only seeks to identify key
Entry System (SMILES) embeddings for molecular property challenges but also offers valuable insights for both novice
and drug-drug interaction (DDI) predictions [2]. This su- users looking to get started with LLaMA and experienced
perior performance suggests that LLaMA is well-suited for developers seeking to optimize and enhance their imple-
tasks that require high degree of precision and the handling mentations. Ultimately, our findings will contribute to im-
of specific domain data, further distinguishing it from other proving the support and resources available to the LLaMA
36
Figure 1: A LLaMA question of Stack Overflow (Post ID 77267346)
community, facilitating smoother adoption and integration As a result, we adopted a more direct and up-to-date data
of the model into various workflows. collection approach. We utilized the web scraping tool4 to
We ask the following research questions in this study. scrape posts directly from Stack Overflow. The scraping
process was conducted on July 22, 2024. To comply with
1. RQ1: What are the topics of discussion about LLaMA
Stack Overflow’s usage policies and avoid overloading their
on Stack Overflow? We desire to identify and cate-
servers, we incorporated waiting times between requests.
gorize the topics of discussion related to “LLaMA”
The data collected included the posts’ links, titles, bodies,
on Stack Overflow. This is to determine the most
and tags.
common themes and issues raised by the developer
To effectively capture posts related to “LLaMA”, we em-
community concerning LLaMA.
ployed two distinct methods:
2. RQ2: What are the related topics when discussing Method 1: Keyword Search — We conducted a search
LLaMA on Stack Overflow? The second research on Stack Overflow using the keyword “LLaMA”5 . This search
question focused on identifying related tags co- yielded 2,405 posts, which we categorized as follows:
occurring with the LLaMA tag on Stack Overflow.
This is to find other relevant topics or challenges • Title Group (644 posts): Posts where “LLaMA” ap-
that LLaMA users may face or need to study. peared in the title.
• Body Group (1,761 posts): Posts where “LLaMA”
This section details the steps undertaken to address the
appeared in the body. However, after manual inspec-
two research questions posed earlier. As illustrated in Figure
tion, many of these posts were deemed irrelevant
2, our methodology involves three key phases: data collec-
and thus excluded from further analysis.
tion, preprocessing, and analysis. Each phase is designed
to ensure a systematic and thorough examination of Stack Method 2: Tag Search (770 posts) — We also searched
Overflow discussions related to LLaMA. In the data collec- for posts tagged with “LLaMA” on Stack Overflow6 . This
tion phase, we gathered relevant posts from Stack Overflow, search resulted in 770 posts, which were compiled into a
ensuring a representative sample of developer interactions. separate group called the Tag Group.
This was followed by the preprocessing phase, where we
cleansed and refined the data to ensure its quality and rel-
evance for analysis. Finally, the analysis phase involved
3.2. Data Preprocessing
categorizing the posts and performing keyword frequency Data preprocessing was essential to ensure the relevance
analysis to uncover common themes and patterns. and quality of the data used in our analysis. The following
steps were undertaken to refine the data:
3.1. Data Collection Step 1: Tag Separation — The tags in the Tag Group
were initially compiled as a single string. To analyze the tags
Our study is based on data collected directly from Stack associated with each post more precisely, we separated them
Overflow, particularly focusing on posts related to LLaMA, into individual tags, enabling more effective identification
the generative AI model from Meta. Initially, we considered and analysis.
using the Stack Overflow public data dump files, including Step 2: Duplicate Removal — During preprocessing, we
Posts.xml and Tags.xml3 . However, after downloading identified overlaps between the Title Group and Tag Group,
and inspecting these files, we found that they did not con- as some posts appeared in both groups due to being tagged
tain recent posts relevant to our study, particularly those
involving technologies like LLaMA, likely due to the release 4
Web Scraper version 1.87.6 (available at: https://webscraper.io/)
of LLaMA being more recent than the last update of the data 5
We queried from the URL https://stackoverflow.com/search?tab=
dump. newest&q=LLaMA&searchOn=3
6
We queried from the URL https://stackoverflow.com/questions/tagged/
3
https://archive.org/details/stackexchange LLaMA?tab=Newest
37
Figure 2: The Experimental Procedure
with “LLaMA.” Additionally, we detected duplicate entries Handling and Debugging, Installation and Setup Issues, Inte-
with identical post links and titles. These redundancies gration and API Usage, Runtime and Performance Issues, and
were removed, resulting in a refined dataset of 473 posts Model Deployment and Hosting. These six groups were estab-
comprising 395 posts tagged with “LLaMA” and 78 posts lished before the manual classification by the first authors
without the tag. during the data collection and data preprocessing steps. One
post could fall into multiple categories. Any disagreements
3.3. Dataset Characteristics were resolved through discussion until a consensus was
reached.
After data collection and preprocessing, our final dataset RQ2: What related topics and technologies are as-
consisted of 473 posts, all centered on LLaMA-related topics. sociated with LLaMA? — We examined the tags associ-
These posts cover a range of issues, questions, and discus- ated with “LLaMA” to identify related topics and technolo-
sions about LLaMA, including configuration, usage, and gies. The co-occurrence of these tags with “LLaMA” shows
challenges. the broader technological ecosystem and application areas
For instance, a typical post in our dataset may include a linked to LLaMA.
query about fine-tuning the LLaMA model:
“How do I fine-tune the LLaMA model on a 4. Results
custom dataset? I’m facing memory issues
during training and could use some advice on This section presents the findings from our analysis of the
optimizing performance.” discussions related to the LLaMA model on Stack Overflow
and the answers to our research questions. We address
Another example might address integration issues: the research questions (RQ1 and RQ2) through a detailed
“I’m trying to integrate LLaMA with an exist- examination of the collected and cleansed datasets.
ing API but keep encountering errors during
the authentication process. Has anyone faced 4.1. Answering RQ1
similar issues?”
To answer RQ1, we manually categorized the posts into
These examples illustrate the types of discussions that six distinct categories based on the nature of the issues dis-
form the basis of our subsequent analysis. cussed. To assess the reliability of the manual classification,
we calculated the inter-rater reliability using Cohen’s Kappa
statistic. The Kappa score was 0.883, indicating an almost
3.4. Data Analysis perfect agreement between the two authors. The catego-
Using the cleansed datasets, we analyzed the topics of dis- rization helped us to identify the most common themes in
cussion related to LLaMA on Stack Overflow to address our the developer community’s conversations about LLaMA. Ta-
research questions: ble 1 provides a summary of the categories and the number
RQ1: What are the common topics discussed regard- of posts that relate to each category.
ing LLaMA? — We manually classified the titles and bodies From our analysis, it is evident that the majority of discus-
of the posts to identify common topics. To ensure thorough- sions focus on Model Configuration and Fine-Tuning, with
ness, the first author initially skimmed through all posts to 135 posts, making it the most frequently discussed topic.
get a sense of the themes and formulated the six categories This suggests that many developers are struggling with con-
as a preliminary structure. Then the first and second au- figuring and fine-tuning LLaMA models to meet specific
thors independently reviewed all posts, categorizing them needs. Posts in this category often mention challenges such
into six groups: Model Configuration and Fine-Tuning, Error as adjusting hyperparameters, loading pre-trained models,
38
Figure 3: Example of Model Configuration and Fine-Tuning post (Post ID 76880690)
Table 1 challenges, and difficulties in setting up dependencies. The
Categories of LLaMA Discussion on Stack Overflow high number of posts in this category indicates that getting
started with LLaMA can be particularly challenging, espe-
Category Number of Posts
cially for users who are new to the model or unfamiliar with
Model Configuration and Fine-Tuning 135 the broader ecosystem of tools it integrates with. Figure
Error Handling and Debugging 110 5 shows a Stack Overflow post titled “Cuda 12.2 and issue
Installation and Setup Issues 91 with bitsandbytes package installation” categorized under
Integration and API Usage 86 “Installation and Setup Issues.” In this post, the developer is
Runtime and Performance Issues 73
facing an issue with running Llama 2 on Google Colab and
Model Deployment and Hosting 24
asks for help.
Total 519 Integration and API Usage, with 86 posts, reflects discus-
sions on how to connect LLaMA with other systems, par-
ticularly through APIs. Developers often seek guidance
and optimizing models for specific tasks or datasets. The on integrating LLaMA into existing workflows, leveraging
prevalence of this category suggests that LLaMA’s flexibility its capabilities alongside other tools, and addressing API-
and complexity in configuration require careful attention related challenges. These discussions highlight the impor-
and often lead to challenges that developers seek to over- tance of seamless integration between LLaMA and other
come. Posts in this category commonly address issues like technologies, as well as the need for clear guidelines on API
adjusting hyperparameters, loading pre-trained models, and usage.
optimizing models for particular tasks or datasets. The Runtime and Performance Issues, comprising 73 posts, fo-
prominence of this category indicates that LLaMA’s flexibil- cuses on challenges that developers face during the execu-
ity and complexity in configuration often present challenges tion of LLaMA models. This includes discussions on optimiz-
that developers actively seek to resolve. Figure 3 shows a ing model performance, managing resource consumption,
Stack Overflow post titled “Chat with spreadsheet using and addressing latency issues. Posts in this category often
Meta Llama (Llama 2 13B Chat HF),” categorized under the highlight the need for efficient execution of LLaMA models,
Model Configuration and Fine-Tuning category. In this post, especially in production environments where performance
the questioner is facing the problem of using LLaMA for is critical.
querying spreadsheet data. Model Deployment and Hosting, with 24 posts, is the least
Error Handling and Debugging, accounting for 110 posts. discussed category. Posts here focus on deploying LLaMA
This category includes posts where developers encountered models into production, managing model versions, and host-
errors during the use of LLaMA and sought solutions to re- ing models on different platforms. The relatively low num-
solve these issues. Common topics in this category involve ber of posts in this category might suggest that deployment
troubleshooting runtime errors, resolving compatibility is- is a more advanced stage of working with LLaMA, which
sues with other libraries, and debugging scripts that fail to fewer users have reached, or that deployment-related issues
execute as expected. The prevalence of posts in this cate- are less frequent or already well-documented within the
gory underscores the need for robust debugging tools and community.
clear documentation to help developers efficiently resolve Overall, the distribution of posts across these categories
issues. Figure 4 depicts a Stack Overflow post titled “How provides valuable insights into the areas where LLaMA users
to debug the Llama 2 inference command with VSCode,” are most likely to encounter challenges. It also highlights
which is categorized under “Error Handling and Debugging.” the importance of comprehensive support and resources
In this post, the questioner asks about configuring Visual in the areas of model configuration, error handling, and
Studio Code to debug the Llama 2 inference script. integration.
Installation and Setup Issues is another prominent cate-
gory, comprising 91 posts. This category covers problems en-
countered during the initial stages of working with LLaMA,
including installation errors, environment configuration
39
Figure 4: Example of the Error Handling and Debugging post (Post ID 77421713)
Figure 5: Example of Installation and Setup Issues post (Post ID 78194505)
4.2. Answering RQ2 learning tools. LangChain, in particular, is a framework
designed for building applications with LLMs, suggesting
To address RQ2, we examined the co-occurrence of tags
that LLaMA users are developing complex workflows that
in posts discussing LLaMA. By analyzing these tags, we
involve multiple LLMs.
aimed to identify related topics and technologies that are
Notably, the openai-api tag appeared in 26 posts, indicat-
commonly mentioned alongside LLaMA on Stack Overflow.
ing a significant interest in interoperability between LLaMA
Table 2 summarizes the frequency of the most common
and OpenAI’s models. The posts in this category reveal
co-occurring tags.
several common themes:
The analysis revealed that the large-language-model tag
was the most frequently co-occurring tag with LLaMA, ap- 1. Interoperability Between LLaMA and OpenAI Mod-
pearing in 201 posts. This suggests that discussions around els: Many posts discuss how to integrate or migrate
LLaMA are often framed within the broader context of large between LLaMA models and OpenAI APIs. For in-
language models, indicating that developers are considering stance, questions related to migrating from ChatGPT
LLaMA alongside other major models in this category. The to Llama 2 or using different LlamaIndex chat en-
frequent mention of python (184 posts) and huggingface- gine modes with an OpenAI key suggest that users
transformers (109 posts) indicates that developers are ac- are exploring how to use both systems together or
tively using Python-based tools and libraries, particularly comparing their functionalities.
Hugging Face’s Transformers library, to work with LLaMA. 2. LangChain and LLaMA: Several posts mention
This reflects LLaMA’s integration into the Python ecosys- LangChain in conjunction with LLaMA. LangChain
tem and its compatibility with popular machine-learning is a framework for building applications with LLMs,
frameworks. and the discussions around using it with LLaMA sug-
The co-occurrence of tags like langchain (77 posts) and py- gest that users are working on sophisticated work-
torch (70 posts) further supports the observation that LLaMA flows involving multiple language models. This high-
is frequently used in conjunction with other machine- lights LLaMA’s role in the broader landscape of lan-
40
Table 2 LLaMA AI. This assumption may have resulted in the in-
Top Co-Occurring Tags with llama Tag on Stack Overflow clusion of irrelevant or off-topic content. We mitigated this
risk by performing a manual verification of 500 posts to en-
Tag Occurrences
sure relevance, though some less obvious irrelevant content
llama 398 might still remain. Additionally, our reliance on manual clas-
large-language-model 201 sification introduces the risk of human error and bias. To
python 184 address this, two authors independently classified the posts,
huggingface-transformers 109 and any discrepancies were resolved through discussion
langchain 77
to increase consistency and reduce subjectivity. However,
pytorch 70
huggingface 66 biases inherent in manual processes may still exist, and the
llama-index 40 absence of automated classification tools may have limited
nlp 39 the scalability of the analysis.
artificial-intelligence 34 Furthermore, the data collection was conducted only up
fine-tuning 28 until July 22, 2024, which excludes newer posts. As the field
openai-api 26 of large language models (LLMs) evolves rapidly, this limi-
machine-learning 25 tation may have prevented us from capturing recent trends
ollama 23 or emerging challenges, potentially affecting the complete-
llamacpp 22 ness and timeliness of our analysis. External Validity: The
llama-cpp-python 22
findings are based solely on Stack Overflow (SO) posts with
llama3 20
python-3.x 20
the keyword “LLaMA” in the titles or tags, which may limit
amazon-sagemaker 19 the generalizability of our results to other technical Q&A
chatbot 14 platforms such as GitHub, Reddit, or specialized forums
gpu 14 where different types of discussions and more complex tech-
amazon-web-services 12 nical issues may be addressed. By focusing exclusively on
Others 356 SO, we may have missed richer, more nuanced developer
Total 1,819 challenges that could provide a broader understanding of
LLaMA adoption across different communities.
guage model applications. 5. Implications
3. Fine-Tuning and Model Performance: With the fine-
tuning tag appearing in 28 posts, this category re- The findings from this study provide valuable insights into
flects discussions around optimizing LLaMA models. the quality of Meta’s LLaMA model and how the developer
Posts such as “LLaMA Index training my own model community engages with it on Stack Overflow, particularly
gives poor results” indicate challenges in fine-tuning in terms of overcoming technical challenges. The analysis
LLaMA models. Users seek advice on improving reveals that discussions predominantly focus on issues such
model performance, particularly in fine-tuning and as configuring, fine-tuning, and integrating LLaMA into
optimizing models for specific tasks. various applications. This highlights the model’s flexibility
4. Vector Databases and RetrievalQA: Discussions in but also points to its complexity, underscoring the need for
this area involve using LLaMA models with vector improved documentation, resources, and tools.
databases and RetrievalQA. Users are focusing on One key implication is the necessity for enhanced com-
effectively retrieving documents or managing stor- munity support and resources for model configuration and
age when integrating LLaMA with OpenAI’s API, fine-tuning. The frequency of posts on these topics suggests
reflecting the complexity of tasks users are under- that many developers, especially those without advanced
taking. expertise in machine learning, encounter significant diffi-
5. Computational Resources: Questions related to hard- culties. By improving documentation and offering more
ware usage, such as issues with running LLaMA user-friendly tools, Meta could lower the barrier to entry
models on CPUs or optimizing GPU usage, high- for a wider audience, leading to broader adoption of LLaMA.
light developers’ concerns about the computational This could also include the development of community-
demands of LLaMA models. Tags such as gpu and driven forums, FAQs, or official support channels dedicated
amazon-sagemaker appear alongside discussions fo- to troubleshooting configuration and fine-tuning issues.
cused on resource optimization. Another important implication is the need to prioritize
seamless integration with existing machine-learning ecosys-
These findings illustrate that LLaMA is part of a larger tems. The co-occurrence analysis shows that LLaMA is fre-
ecosystem of tools and technologies, with significant inter- quently used in conjunction with popular frameworks like
est in how it can be integrated with or compared to other Hugging Face’s Transformers, PyTorch, and LangChain, par-
models, particularly those from OpenAI. The discussions ticularly in Python environments. This suggests that future
also underscore the importance of effective model manage- iterations of LLaMA should focus on making integration
ment, performance optimization, and resource utilization with these frameworks more straightforward and efficient,
when working with LLaMA. potentially through more robust APIs, pre-built connectors,
or better interoperability guidelines. Ensuring compatibility
4.3. Threats to Validity with widely-used tools will be crucial in positioning LLaMA
as a go-to solution for developers working on real-world
Several threats to validity may impact the findings of this
applications. Finally, the relatively low number of posts
study. Internal Validity: One potential threat is the assump-
discussing the deployment and hosting of LLaMA models
tion that all posts in the dataset were relevant to Meta’s
41
suggests that this is still an emerging area. However, as 7. ACKNOWLEDGEMENT
more developers move toward deploying LLaMA models in
production environments, there will likely be an increasing This research project was supported by the Faculty of ICT,
demand for comprehensive deployment tools, best practices, Mahidol University.
and infrastructure support.
References
6. Conclusion and Future Work
[1] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
In conclusion, this preliminary study provides a detailed Lacha ux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,
investigation of the quality, the challenges, and related top- F. Azhar, et al., Llama: Open and efficient foundation
ics in the Stack Overflow community’s discussions about language models, arXiv preprint arXiv:2302.13971
LLaMA. By understanding these areas, Meta and the broader (2023).
developer community can better support the use of LLaMA, [2] S. Sadeghi, A. Bui, A. Forooghi, J. Lu, A. Ngom, Can
ultimately driving innovation in LLM development. large language models understand molecules?,
This study provides valuable insights into the challenges 2024. URL: https://arxiv.org/abs/2402.00024.
developers face when adopting LLaMA, based on Stack Over- arXiv:2402.00024.
flow discussions. However, several areas for future research [3] K. Wangsa, S. Karim, E. Gide, M. Elkhodr, A systematic
could significantly enrich the findings and address the limi- review and comprehensive analysis of pioneering ai
tations identified in this study. First, expanding the dataset chatbot models from education to healthcare: Chat-
to include posts beyond July 2024 will help capture evolving gpt, bard, llama, ernie and grok, Future Internet 16
trends as LLaMA and other large language models (LLMs) (2024). URL: https://www.mdpi.com/1999-5903/16/7/
continue to develop. Additionally, incorporating data from 219. doi:10.3390/fi16070219.
other platforms such as GitHub Issues, Reddit, and developer [4] J. Son, B. Kim, Trend Analysis of Large Language
forums could provide a broader perspective on LLaMA’s us- Models through a Developer Community: A Focus on
age, especially on more complex technical problems and Stack Overflow, Information 14 (2023).
nuanced discussions that may not be captured on Stack [5] A. Hörnemalm, O. Norberg, T. Mejtoft, ChatGPT as a
Overflow alone. Comparing LLaMA to other LLMs, such as Software Development Tool The Future of Develop-
ChatGPT or Claude, would also provide valuable insights, ment, Master’s thesis, Umeå University, Department
allowing researchers to understand LLaMA’s challenges in of Applied Physics and Electronics, 2023.
the broader landscape and better justify its focus. [6] L. Zhong, Z. Wang, Can llm replace stack overflow? a
Furthermore, future research should enhance the method- study on robustness and reliability of large language
ology by employing a more rigorous approach to data fil- model code generation, in: Proceedings of the AAAI
tering and analysis. Pre-processing the data to exclude triv- Conference on Artificial Intelligence, volume 38, 2024,
ial questions and focusing on more substantial challenges pp. 21841–21849.
would yield more meaningful insights. Using established [7] K. Jin, C.-Y. Wang, H. V. Pham, H. Hemmati, Can
qualitative coding frameworks for topic classification would ChatGPT Support Developers? An Empirical Evalua-
further improve the transparency and validity of the analy- tion of Large Language Models for Code Generation,
sis. Another promising direction is incorporating sentiment in: Proceedings of the 21st International Conference
analysis to understand community attitudes toward LLaMA. on Mining Software Repositories, MSR ’24, 2024, p.
By analyzing the tone of discussions across platforms, re- 167–171.
searchers could uncover whether developers’ experiences [8] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sen-
with LLaMA are generally positive, negative, or neutral, of- gupta, S. Yoo, J. M. Zhang, Large Language Models for
fering Meta and the developer community actionable feed- Software Engineering: Survey and Open Problems, in:
back for improving the tool. ICSE-FoSE’23, 2023, pp. 31–53.
Additionally, complementing the analysis with user stud- [9] A. Buscemi, D. Proverbio, Chatgpt vs gemini vs llama
ies—such as surveys or interviews—could provide a deeper on multilingual sentiment analysis, 2024. URL: https:
understanding of the practical challenges faced by develop- //arxiv.org/abs/2402.01715. arXiv:2402.01715.
ers using LLaMA in real-world scenarios. Exploring specific [10] A. Salinas, P. Shah, Y. Huang, R. McCormack,
use cases where LLaMA is integrated into different applica- F. Morstatter, The unequal opportunities of large
tion domains, such as natural language processing (NLP) or language models: Examining demographic biases
enterprise applications, could reveal unique challenges and in job recommendations by chatgpt and llama, in:
benefits in various contexts. Finally, investigating advanced Proceedings of the 3rd ACM Conference on Eq-
stages of LLaMA adoption, particularly in production envi- uity and Access in Algorithms, Mechanisms, and
ronments, would help identify issues related to deployment Optimization, EAAMO ’23, Association for Com-
and model hosting, offering a more complete picture of puting Machinery, New York, NY, USA, 2023.
LLaMA’s practical applications and limitations. By address- URL: https://doi.org/10.1145/3617694.3623257. doi:10.
ing these areas, future research will contribute to a more 1145/3617694.3623257.
comprehensive understanding of LLaMA’s role within the [11] L. Da Silva, J. Samhi, F. Khomh, Chatgpt vs llama:
LLM ecosystem, driving more effective support for develop- Impact, reliability, and challenges in stack overflow
ers and fostering broader adoption of open-source LLMs. discussions, arXiv preprint arXiv:2402.08801 (2024).
42