=Paper=
{{Paper
|id=Vol-3864/quasoq-2024-paper-05
|storemode=property
|title=Challenges in Adopting LLaMA: An Empirical Study of Discussions on Stack Overflow
|pdfUrl=https://ceur-ws.org/Vol-3864/quasoq-2024-paper-05.pdf
|volume=Vol-3864
|authors=Ramita Deeprom,Shiyu Yang,Yoshiki Higo,Morakot Choetkiertikul,Chaiyong Ragkhitwetsagul
|dblpUrl=https://dblp.org/rec/conf/apsec/DeepromYHCR24
}}
==Challenges in Adopting LLaMA: An Empirical Study of Discussions on Stack Overflow==
<pdf width="1500px">https://ceur-ws.org/Vol-3864/quasoq-2024-paper-05.pdf</pdf>
<pre>
                         Challenges in Adopting LLaMA: An Empirical Study of
                         Discussions on Stack Overflow
                         Ramita Deeprom1,* , Shiyu Yang2 , Yoshiki Higo2 , Morakot Choetkiertikul1 and
                         Chaiyong Ragkhitwetsagul1
                         1
                             Faculty of Information and Communication Technology, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhon Pathom 73170 THAILAND
                         2
                             Graduate School of Information Science and Technology, Osaka University 1-5, Yamadaoka, Suita, Osaka, 565-0871, Japan


                                           Abstract
                                           LLaMA (Large Language Model Meta AI) has quickly gained traction among developers due to its wide-ranging applications and its
                                           capabilities to be integrated into software projects. As interest in LLaMA grows, discussions around it have surged on platforms like
                                           Stack Overflow. The developer community, with its collaborative nature, serves as a valuable source for studying LLaMA’s quality, its
                                           emerging trends, and insights into its usage. Despite this growing attention, there has been no comprehensive study examining how the
                                           community interacts with and discusses LLaMA. This study addresses that gap by exploring conversations on Stack Overflow related to
                                           LLaMA and its quality, with the objective of identifying key themes and recurring patterns in these discussions. We systematically
                                           collected and analyzed 473 posts from Stack Overflow that contained the keyword “LLaMA” or were tagged accordingly. The analysis
                                           revealed that prominent topics of discussion include model configuration, error handling, and integration with other technologies.
                                           Furthermore, we identified frequent co-occurring tags, underscoring LLaMA’s integration within the larger ecosystem of large language
                                           models and its interoperability with widely used frameworks, such as Python and Hugging Face Transformers. The findings highlight
                                           the complexity of working with LLaMA, especially in model configuration and fine-tuning, indicating a need for better resources,
                                           documentation, and community support. The study also suggests that future development should prioritize interoperability with
                                           popular machine-learning frameworks to improve the LLM’s quality and to strengthen LLaMA’s role in the AI ecosystem.

                                           Keywords
                                           LLaMA, Stack Overflow, Large Language Models’ Quality


                         1. Introduction                                                                                                   platforms like Stack Overflow (SO),2 an online community
                                                                                                                                           where developers ask questions, share knowledge, and pro-
                         The rapid advancements in artificial intelligence (AI) have                                                       vide solutions related to software development and technol-
                         revolutionized the field of technology, leading to the cre-                                                       ogy. SO has become one of the most widely used platforms
                         ation of powerful large language models (LLMs) that are                                                           for developers to collaborate, troubleshoot, and learn from
                         transforming how developers and organizations approach                                                            each other, making it a rich source of information about
                         problem-solving. One such model is Meta’s LLaMA1 , an                                                             real-world challenges and practical applications of various
                         open-source LLM that has garnered substantial attention                                                           technologies. Studying SO is essential because it reflects the
                         from the developer community [1]. Unlike many propri-                                                             collective experiences and expertise of a global community
                         etary models, LLaMA offers developers the flexibility to                                                          of developers, providing valuable insights into the quality,
                         fine-tune and customize the model for specific use cases,                                                         common issues, and trends that arise with new technologies
                         making it an attractive alternative for those who require                                                         like LLaMA. By examining the discussions on SO, we can
                         more control and adaptability in their applications [1].                                                          better understand not only the key themes and challenges
                            Recent studies have demonstrated LLaMA’s superior per-                                                         developers face with LLaMA but also the broader context
                         formance in specific domain tasks, such as cheminformat-                                                          of its integration and adoption in various fields. This under-
                         ics, where it has outperformed models like ChatGPT in                                                             standing is critical for identifying areas where additional
                         tasks such as SMILES embeddings for predicting molecu-                                                            support, documentation, or tools might be needed to im-
                         lar properties and drug-drug interactions (DDI) [2]. This                                                         prove the developer experience and further promote the
                         suggests that LLaMA is particularly effective in tasks that                                                       effective use of LLaMA.
                         demand high degree of precision and domain-specific ex-                                                              This study aims to address an initial gap by conducting
                         pertise, setting it apart from other LLMs. While models                                                           an empirical analysis of Stack Overflow posts tagged with
                         like ChatGPT, Bard, and Ernie may offer unique features                                                           LLaMA to identify the predominant discussion topics related
                         such as real-time web access or higher computational effi-                                                        to its quality and adoption, and associated technologies. By
                         ciency, LLaMA stands out by providing a well-rounded bal-                                                         employing keyword frequency analysis and categorizing the
                         ance across various criteria, making it suitable for a broader                                                    posts, this study seeks to answer two key research questions:
                         range of applications [3].                                                                                        (1) What are the main topics of discussion regarding LLaMA
                            The growing interest in LLaMA is particularly evident on                                                       on Stack Overflow? and (2) What related themes emerge in
                         QuASoQ 2024: 12th International Workshop on Quantitative Approaches
                                                                                                                                           these discussions? Through this initial analysis, we aim to
                         to Software Quality, December 03, 2024, Chongqing, China                                                          provide early insights into the specific challenges developers
                         *
                           Corresponding author.                                                                                           face, the solutions they seek, and the broader implications
                         $ ramita.dep@student.mahidol.ac.th (R. Deeprom);                                                                  for LLaMA’s role within the AI ecosystem. The findings
                         yangsy@ist.osaka-u.ac.jp (S. Yang); higo@ist.osaka-u.ac.jp (Y. Higo);                                             from this research study will serve as a foundation for a
                         morakot.cho@mahidol.ac.th (M. Choetkiertikul);
                         chaiyong.rag@mahidol.ac.th (C. Ragkhitwetsagul)
                                                                                                                                           more comprehensive future study, contributing valuable
                          https://ysy-dlg.github.io/MyHomePage (S. Yang);                                                                 insights to both practitioners and researchers as we further
                         https://sites.google.com/view/yhigo/home (Y. Higo);                                                               our understanding of LLaMA’s use and integration within
                         https://morakotch.wordpress.com/ (M. Choetkiertikul);                                                             diverse technical environments.
                         https://cragkhit.github.io/ (C. Ragkhitwetsagul)                                                                     The structure of this paper is as follows. Section 2 pro-
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                         1                                                                                                                 2
                             https://github.com/meta-llama/llama                                                                               https://stackoverflow.com/


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                      35
vides the background and related work, detailing prior re-            LLMs. Similarly, comparative analyses have shown that
search on the adoption of large language models (LLMs)                while ChatGPT and other models like Bard and Ernie offer
such as LLaMA and their application in real-world scenar-             advantages in certain areas, such as real-time internet ac-
ios. The methodology employed in our research, including              cess or computational efficiency, LLaMA provides balanced
data collection and preprocessing techniques, is explained            performance across multiple criteria, making it a versatile
in Section 3. Section 4 presents the results of our empirical         tool for various applications [3].
study, focusing on the analysis of Stack Overflow discus-                Moreover, the performance of Llama 2 has been noted
sions to answer the research questions posed in this study.           to exhibit minimal variation across different languages, of-
We then discuss the implications of our findings in Section 5,        fering consistency in sentiment analysis tasks. However,
where we highlight the key challenges faced by developers             this consistency sometimes comes at the cost of skewing
when working with LLaMA and suggest potential improve-                ratings towards positive sentiment, even in scenarios where
ments for future development. Finally, Section 6 concludes            more nuanced interpretations are required [9]. Further-
the paper and outlines potential avenues for future research,         more, recent studies on job recommendations generated
such as expanding the dataset and exploring more advanced             by LLaMA reveal both strengths and limitations. While
stages of LLaMA adoption.                                             LLaMA suggests a wider variety of professions compared to
                                                                      ChatGPT, its recommendations often include impractical or
                                                                      nonsensical roles, reflecting a trade-off between diversity
2. Background and Related Work                                        and practicality [10]. This indicates the need for improved
                                                                      prompt engineering and bias mitigation in LLM applications
The rapid adoption of generative AI, particularly large lan-
                                                                      to ensure fairer and more relevant outcomes across diverse
guage models (LLMs), has sparked significant interest in
                                                                      user groups.
understanding how users are integrating these tools into
                                                                         Several studies have leveraged Stack Overflow data to
their workflows. Previous research shows that many profes-
                                                                      analyze trends within the developer community, providing
sionals increasingly rely on generative AI, such as ChatGPT
                                                                      insights into quality, common challenges, emerging tech-
and LLaMA, to solve problems traditionally addressed on
                                                                      nologies, and evolving developer needs. Silva et al. [11]
platforms like Stack Overflow (SO) [4, 5]. This shift sug-
                                                                      report that ChatGPT has significantly impacted SO, offering
gests a change in the problem-solving paradigm, where
                                                                      fast, human-like responses that have raised questions about
AI-generated solutions are becoming a first resort for many
                                                                      the platform’s future in the AI era. The study noted a decline
developers, streamlining the troubleshooting process and
                                                                      in overall SO activity, though some communities remain ac-
improving efficiency [4]. However, despite the growing re-
                                                                      tive. Both models excel at addressing general programming
liance on AI, recent studies indicate that not all users are
                                                                      queries but struggle with specific frameworks and libraries,
fully satisfied with AI-generated responses. Some develop-
                                                                      leading developers to return to SO when LLMs fall short.
ers still face challenges, particularly with complex technical
                                                                      Similarly, Zhong et al. [6] developed the RobustAPI dataset,
issues, prompting them to seek human-based community
                                                                      featuring 1,208 coding questions from SO related to 18 Java
support on platforms like SO [6, 5]. This highlights the limi-
                                                                      APIs. Their study revealed that even advanced models like
tations of AI models in delivering contextually accurate and
                                                                      GPT-4 produced API misuses in 62% of the generated code,
reliable answers for more nuanced problems [7, 5].
                                                                      posing risks when applied to real-world software develop-
   LLaMA, an open-source LLM created by Meta, offers no-
                                                                      ment.
table advantages that contribute to its rising popularity
                                                                         Nonetheless, there is no study that investigates the quality
within the developer community. Released to the public
                                                                      of LLaMA and its adoption in practice. This study fills in
in February 2023, with LLaMA 3.1 debuting in July 2024,
                                                                      the gap by studying the discussions related to LLaMA on
the model has garnered over 300 million downloads glob-
                                                                      SO discussions.
ally, underscoring its widespread adoption [8]. Compared
to ChatGPT, LLaMA is perceived as more complex to install
and configure, yet its appeal lies in its ability to provide          3. Methodology
fine-tuned, context-specific outputs, making it particularly
attractive to developers who require precision and control            As shown in Figure 1, a motivating example is a Stack
[8, 3]. Furthermore, LLaMA’s enhanced security features               Overflow post where a user inquires about installing the
and the ability to be hosted internally within organizations          LLaMA-cpp-python package. This post has garnered
without the risk of leaking sensitive information make it a           38,975 views (at the time of writing), illustrating the
strong contender for enterprise use cases [3]. These charac-          widespread interest in LLaMA but also highlighting that
teristics reduce the risk of biased outputs, which is often a         developers frequently encounter challenges requiring exter-
concern for beginners relying too heavily on AI-generated             nal help. Despite its growing popularity, the installation and
responses [3]. The model’s open-source nature also allows             configuration of LLaMA packages remain common stum-
for greater flexibility in integration and customization, of-         bling blocks.
fering experienced developers a robust tool for specialized              In light of this, our research focuses on examining the
applications [3, 2].                                                  discussions surrounding LLaMA on Stack Overflow. By
   Studies have highlighted that LLaMA excels in certain              analyzing these interactions, we aim to uncover the most
domain-specific tasks, such as cheminformatics, where it              prevalent issues and limitations faced by developers work-
outperforms ChatGPT in Simplified Molecular Input Line                ing with LLaMA. This study not only seeks to identify key
Entry System (SMILES) embeddings for molecular property               challenges but also offers valuable insights for both novice
and drug-drug interaction (DDI) predictions [2]. This su-             users looking to get started with LLaMA and experienced
perior performance suggests that LLaMA is well-suited for             developers seeking to optimize and enhance their imple-
tasks that require high degree of precision and the handling          mentations. Ultimately, our findings will contribute to im-
of specific domain data, further distinguishing it from other         proving the support and resources available to the LLaMA


                                                                 36
Figure 1: A LLaMA question of Stack Overflow (Post ID 77267346)


community, facilitating smoother adoption and integration               As a result, we adopted a more direct and up-to-date data
of the model into various workflows.                                  collection approach. We utilized the web scraping tool4 to
   We ask the following research questions in this study.             scrape posts directly from Stack Overflow. The scraping
                                                                      process was conducted on July 22, 2024. To comply with
        1. RQ1: What are the topics of discussion about LLaMA
                                                                      Stack Overflow’s usage policies and avoid overloading their
           on Stack Overflow? We desire to identify and cate-
                                                                      servers, we incorporated waiting times between requests.
           gorize the topics of discussion related to “LLaMA”
                                                                      The data collected included the posts’ links, titles, bodies,
           on Stack Overflow. This is to determine the most
                                                                      and tags.
           common themes and issues raised by the developer
                                                                        To effectively capture posts related to “LLaMA”, we em-
           community concerning LLaMA.
                                                                      ployed two distinct methods:
        2. RQ2: What are the related topics when discussing             Method 1: Keyword Search — We conducted a search
           LLaMA on Stack Overflow? The second research               on Stack Overflow using the keyword “LLaMA”5 . This search
           question focused on identifying related tags co-           yielded 2,405 posts, which we categorized as follows:
           occurring with the LLaMA tag on Stack Overflow.
           This is to find other relevant topics or challenges              • Title Group (644 posts): Posts where “LLaMA” ap-
           that LLaMA users may face or need to study.                        peared in the title.
                                                                            • Body Group (1,761 posts): Posts where “LLaMA”
   This section details the steps undertaken to address the
                                                                              appeared in the body. However, after manual inspec-
two research questions posed earlier. As illustrated in Figure
                                                                              tion, many of these posts were deemed irrelevant
2, our methodology involves three key phases: data collec-
                                                                              and thus excluded from further analysis.
tion, preprocessing, and analysis. Each phase is designed
to ensure a systematic and thorough examination of Stack                Method 2: Tag Search (770 posts) — We also searched
Overflow discussions related to LLaMA. In the data collec-            for posts tagged with “LLaMA” on Stack Overflow6 . This
tion phase, we gathered relevant posts from Stack Overflow,           search resulted in 770 posts, which were compiled into a
ensuring a representative sample of developer interactions.           separate group called the Tag Group.
This was followed by the preprocessing phase, where we
cleansed and refined the data to ensure its quality and rel-
evance for analysis. Finally, the analysis phase involved
                                                                      3.2. Data Preprocessing
categorizing the posts and performing keyword frequency               Data preprocessing was essential to ensure the relevance
analysis to uncover common themes and patterns.                       and quality of the data used in our analysis. The following
                                                                      steps were undertaken to refine the data:
3.1. Data Collection                                                     Step 1: Tag Separation — The tags in the Tag Group
                                                                      were initially compiled as a single string. To analyze the tags
Our study is based on data collected directly from Stack              associated with each post more precisely, we separated them
Overflow, particularly focusing on posts related to LLaMA,            into individual tags, enabling more effective identification
the generative AI model from Meta. Initially, we considered           and analysis.
using the Stack Overflow public data dump files, including               Step 2: Duplicate Removal — During preprocessing, we
Posts.xml and Tags.xml3 . However, after downloading                  identified overlaps between the Title Group and Tag Group,
and inspecting these files, we found that they did not con-           as some posts appeared in both groups due to being tagged
tain recent posts relevant to our study, particularly those
involving technologies like LLaMA, likely due to the release          4
                                                                        Web Scraper version 1.87.6 (available at: https://webscraper.io/)
of LLaMA being more recent than the last update of the data           5
                                                                        We queried from the URL https://stackoverflow.com/search?tab=
dump.                                                                   newest&q=LLaMA&searchOn=3
                                                                      6
                                                                        We queried from the URL https://stackoverflow.com/questions/tagged/
3
    https://archive.org/details/stackexchange                           LLaMA?tab=Newest


                                                                 37
Figure 2: The Experimental Procedure


with “LLaMA.” Additionally, we detected duplicate entries            Handling and Debugging, Installation and Setup Issues, Inte-
with identical post links and titles. These redundancies             gration and API Usage, Runtime and Performance Issues, and
were removed, resulting in a refined dataset of 473 posts            Model Deployment and Hosting. These six groups were estab-
comprising 395 posts tagged with “LLaMA” and 78 posts                lished before the manual classification by the first authors
without the tag.                                                     during the data collection and data preprocessing steps. One
                                                                     post could fall into multiple categories. Any disagreements
3.3. Dataset Characteristics                                         were resolved through discussion until a consensus was
                                                                     reached.
After data collection and preprocessing, our final dataset              RQ2: What related topics and technologies are as-
consisted of 473 posts, all centered on LLaMA-related topics.        sociated with LLaMA? — We examined the tags associ-
These posts cover a range of issues, questions, and discus-          ated with “LLaMA” to identify related topics and technolo-
sions about LLaMA, including configuration, usage, and               gies. The co-occurrence of these tags with “LLaMA” shows
challenges.                                                          the broader technological ecosystem and application areas
   For instance, a typical post in our dataset may include a         linked to LLaMA.
query about fine-tuning the LLaMA model:

       “How do I fine-tune the LLaMA model on a                      4. Results
       custom dataset? I’m facing memory issues
       during training and could use some advice on                  This section presents the findings from our analysis of the
       optimizing performance.”                                      discussions related to the LLaMA model on Stack Overflow
                                                                     and the answers to our research questions. We address
  Another example might address integration issues:                  the research questions (RQ1 and RQ2) through a detailed
       “I’m trying to integrate LLaMA with an exist-                 examination of the collected and cleansed datasets.
       ing API but keep encountering errors during
       the authentication process. Has anyone faced                  4.1. Answering RQ1
       similar issues?”
                                                                     To answer RQ1, we manually categorized the posts into
   These examples illustrate the types of discussions that           six distinct categories based on the nature of the issues dis-
form the basis of our subsequent analysis.                           cussed. To assess the reliability of the manual classification,
                                                                     we calculated the inter-rater reliability using Cohen’s Kappa
                                                                     statistic. The Kappa score was 0.883, indicating an almost
3.4. Data Analysis                                                   perfect agreement between the two authors. The catego-
Using the cleansed datasets, we analyzed the topics of dis-          rization helped us to identify the most common themes in
cussion related to LLaMA on Stack Overflow to address our            the developer community’s conversations about LLaMA. Ta-
research questions:                                                  ble 1 provides a summary of the categories and the number
   RQ1: What are the common topics discussed regard-                 of posts that relate to each category.
ing LLaMA? — We manually classified the titles and bodies               From our analysis, it is evident that the majority of discus-
of the posts to identify common topics. To ensure thorough-          sions focus on Model Configuration and Fine-Tuning, with
ness, the first author initially skimmed through all posts to        135 posts, making it the most frequently discussed topic.
get a sense of the themes and formulated the six categories          This suggests that many developers are struggling with con-
as a preliminary structure. Then the first and second au-            figuring and fine-tuning LLaMA models to meet specific
thors independently reviewed all posts, categorizing them            needs. Posts in this category often mention challenges such
into six groups: Model Configuration and Fine-Tuning, Error          as adjusting hyperparameters, loading pre-trained models,


                                                                38
         Figure 3: Example of Model Configuration and Fine-Tuning post (Post ID 76880690)


Table 1                                                              challenges, and difficulties in setting up dependencies. The
Categories of LLaMA Discussion on Stack Overflow                     high number of posts in this category indicates that getting
                                                                     started with LLaMA can be particularly challenging, espe-
 Category                                Number of Posts
                                                                     cially for users who are new to the model or unfamiliar with
 Model Configuration and Fine-Tuning                   135           the broader ecosystem of tools it integrates with. Figure
 Error Handling and Debugging                          110           5 shows a Stack Overflow post titled “Cuda 12.2 and issue
 Installation and Setup Issues                          91           with bitsandbytes package installation” categorized under
 Integration and API Usage                              86           “Installation and Setup Issues.” In this post, the developer is
 Runtime and Performance Issues                         73
                                                                     facing an issue with running Llama 2 on Google Colab and
 Model Deployment and Hosting                           24
                                                                     asks for help.
 Total                                                 519              Integration and API Usage, with 86 posts, reflects discus-
                                                                     sions on how to connect LLaMA with other systems, par-
                                                                     ticularly through APIs. Developers often seek guidance
and optimizing models for specific tasks or datasets. The            on integrating LLaMA into existing workflows, leveraging
prevalence of this category suggests that LLaMA’s flexibility        its capabilities alongside other tools, and addressing API-
and complexity in configuration require careful attention            related challenges. These discussions highlight the impor-
and often lead to challenges that developers seek to over-           tance of seamless integration between LLaMA and other
come. Posts in this category commonly address issues like            technologies, as well as the need for clear guidelines on API
adjusting hyperparameters, loading pre-trained models, and           usage.
optimizing models for particular tasks or datasets. The                 Runtime and Performance Issues, comprising 73 posts, fo-
prominence of this category indicates that LLaMA’s flexibil-         cuses on challenges that developers face during the execu-
ity and complexity in configuration often present challenges         tion of LLaMA models. This includes discussions on optimiz-
that developers actively seek to resolve. Figure 3 shows a           ing model performance, managing resource consumption,
Stack Overflow post titled “Chat with spreadsheet using              and addressing latency issues. Posts in this category often
Meta Llama (Llama 2 13B Chat HF),” categorized under the             highlight the need for efficient execution of LLaMA models,
Model Configuration and Fine-Tuning category. In this post,          especially in production environments where performance
the questioner is facing the problem of using LLaMA for              is critical.
querying spreadsheet data.                                              Model Deployment and Hosting, with 24 posts, is the least
   Error Handling and Debugging, accounting for 110 posts.           discussed category. Posts here focus on deploying LLaMA
This category includes posts where developers encountered            models into production, managing model versions, and host-
errors during the use of LLaMA and sought solutions to re-           ing models on different platforms. The relatively low num-
solve these issues. Common topics in this category involve           ber of posts in this category might suggest that deployment
troubleshooting runtime errors, resolving compatibility is-          is a more advanced stage of working with LLaMA, which
sues with other libraries, and debugging scripts that fail to        fewer users have reached, or that deployment-related issues
execute as expected. The prevalence of posts in this cate-           are less frequent or already well-documented within the
gory underscores the need for robust debugging tools and             community.
clear documentation to help developers efficiently resolve              Overall, the distribution of posts across these categories
issues. Figure 4 depicts a Stack Overflow post titled “How           provides valuable insights into the areas where LLaMA users
to debug the Llama 2 inference command with VSCode,”                 are most likely to encounter challenges. It also highlights
which is categorized under “Error Handling and Debugging.”           the importance of comprehensive support and resources
In this post, the questioner asks about configuring Visual           in the areas of model configuration, error handling, and
Studio Code to debug the Llama 2 inference script.                   integration.
   Installation and Setup Issues is another prominent cate-
gory, comprising 91 posts. This category covers problems en-
countered during the initial stages of working with LLaMA,
including installation errors, environment configuration


                                                                39
          Figure 4: Example of the Error Handling and Debugging post (Post ID 77421713)


          Figure 5: Example of Installation and Setup Issues post (Post ID 78194505)


4.2. Answering RQ2                                                    learning tools. LangChain, in particular, is a framework
                                                                      designed for building applications with LLMs, suggesting
To address RQ2, we examined the co-occurrence of tags
                                                                      that LLaMA users are developing complex workflows that
in posts discussing LLaMA. By analyzing these tags, we
                                                                      involve multiple LLMs.
aimed to identify related topics and technologies that are
                                                                         Notably, the openai-api tag appeared in 26 posts, indicat-
commonly mentioned alongside LLaMA on Stack Overflow.
                                                                      ing a significant interest in interoperability between LLaMA
Table 2 summarizes the frequency of the most common
                                                                      and OpenAI’s models. The posts in this category reveal
co-occurring tags.
                                                                      several common themes:
   The analysis revealed that the large-language-model tag
was the most frequently co-occurring tag with LLaMA, ap-                  1. Interoperability Between LLaMA and OpenAI Mod-
pearing in 201 posts. This suggests that discussions around                  els: Many posts discuss how to integrate or migrate
LLaMA are often framed within the broader context of large                   between LLaMA models and OpenAI APIs. For in-
language models, indicating that developers are considering                  stance, questions related to migrating from ChatGPT
LLaMA alongside other major models in this category. The                     to Llama 2 or using different LlamaIndex chat en-
frequent mention of python (184 posts) and huggingface-                      gine modes with an OpenAI key suggest that users
transformers (109 posts) indicates that developers are ac-                   are exploring how to use both systems together or
tively using Python-based tools and libraries, particularly                  comparing their functionalities.
Hugging Face’s Transformers library, to work with LLaMA.                  2. LangChain and LLaMA: Several posts mention
This reflects LLaMA’s integration into the Python ecosys-                    LangChain in conjunction with LLaMA. LangChain
tem and its compatibility with popular machine-learning                      is a framework for building applications with LLMs,
frameworks.                                                                  and the discussions around using it with LLaMA sug-
   The co-occurrence of tags like langchain (77 posts) and py-               gest that users are working on sophisticated work-
torch (70 posts) further supports the observation that LLaMA                 flows involving multiple language models. This high-
is frequently used in conjunction with other machine-                        lights LLaMA’s role in the broader landscape of lan-


                                                                 40
Table 2                                                              LLaMA AI. This assumption may have resulted in the in-
Top Co-Occurring Tags with llama Tag on Stack Overflow               clusion of irrelevant or off-topic content. We mitigated this
                                                                     risk by performing a manual verification of 500 posts to en-
         Tag                         Occurrences
                                                                     sure relevance, though some less obvious irrelevant content
         llama                                 398                   might still remain. Additionally, our reliance on manual clas-
         large-language-model                  201                   sification introduces the risk of human error and bias. To
         python                                184                   address this, two authors independently classified the posts,
         huggingface-transformers              109                   and any discrepancies were resolved through discussion
         langchain                              77
                                                                     to increase consistency and reduce subjectivity. However,
         pytorch                                70
         huggingface                            66                   biases inherent in manual processes may still exist, and the
         llama-index                            40                   absence of automated classification tools may have limited
         nlp                                    39                   the scalability of the analysis.
         artificial-intelligence                34                      Furthermore, the data collection was conducted only up
         fine-tuning                            28                   until July 22, 2024, which excludes newer posts. As the field
         openai-api                             26                   of large language models (LLMs) evolves rapidly, this limi-
         machine-learning                       25                   tation may have prevented us from capturing recent trends
         ollama                                 23                   or emerging challenges, potentially affecting the complete-
         llamacpp                               22                   ness and timeliness of our analysis. External Validity: The
         llama-cpp-python                       22
                                                                     findings are based solely on Stack Overflow (SO) posts with
         llama3                                 20
         python-3.x                             20
                                                                     the keyword “LLaMA” in the titles or tags, which may limit
         amazon-sagemaker                       19                   the generalizability of our results to other technical Q&A
         chatbot                                14                   platforms such as GitHub, Reddit, or specialized forums
         gpu                                    14                   where different types of discussions and more complex tech-
         amazon-web-services                    12                   nical issues may be addressed. By focusing exclusively on
         Others                                356                   SO, we may have missed richer, more nuanced developer
         Total                               1,819                   challenges that could provide a broader understanding of
                                                                     LLaMA adoption across different communities.

       guage model applications.                                     5. Implications
    3. Fine-Tuning and Model Performance: With the fine-
       tuning tag appearing in 28 posts, this category re-           The findings from this study provide valuable insights into
       flects discussions around optimizing LLaMA models.            the quality of Meta’s LLaMA model and how the developer
       Posts such as “LLaMA Index training my own model              community engages with it on Stack Overflow, particularly
       gives poor results” indicate challenges in fine-tuning        in terms of overcoming technical challenges. The analysis
       LLaMA models. Users seek advice on improving                  reveals that discussions predominantly focus on issues such
       model performance, particularly in fine-tuning and            as configuring, fine-tuning, and integrating LLaMA into
       optimizing models for specific tasks.                         various applications. This highlights the model’s flexibility
    4. Vector Databases and RetrievalQA: Discussions in              but also points to its complexity, underscoring the need for
       this area involve using LLaMA models with vector              improved documentation, resources, and tools.
       databases and RetrievalQA. Users are focusing on                 One key implication is the necessity for enhanced com-
       effectively retrieving documents or managing stor-            munity support and resources for model configuration and
       age when integrating LLaMA with OpenAI’s API,                 fine-tuning. The frequency of posts on these topics suggests
       reflecting the complexity of tasks users are under-           that many developers, especially those without advanced
       taking.                                                       expertise in machine learning, encounter significant diffi-
    5. Computational Resources: Questions related to hard-           culties. By improving documentation and offering more
       ware usage, such as issues with running LLaMA                 user-friendly tools, Meta could lower the barrier to entry
       models on CPUs or optimizing GPU usage, high-                 for a wider audience, leading to broader adoption of LLaMA.
       light developers’ concerns about the computational            This could also include the development of community-
       demands of LLaMA models. Tags such as gpu and                 driven forums, FAQs, or official support channels dedicated
       amazon-sagemaker appear alongside discussions fo-             to troubleshooting configuration and fine-tuning issues.
       cused on resource optimization.                                  Another important implication is the need to prioritize
                                                                     seamless integration with existing machine-learning ecosys-
   These findings illustrate that LLaMA is part of a larger          tems. The co-occurrence analysis shows that LLaMA is fre-
ecosystem of tools and technologies, with significant inter-         quently used in conjunction with popular frameworks like
est in how it can be integrated with or compared to other            Hugging Face’s Transformers, PyTorch, and LangChain, par-
models, particularly those from OpenAI. The discussions              ticularly in Python environments. This suggests that future
also underscore the importance of effective model manage-            iterations of LLaMA should focus on making integration
ment, performance optimization, and resource utilization             with these frameworks more straightforward and efficient,
when working with LLaMA.                                             potentially through more robust APIs, pre-built connectors,
                                                                     or better interoperability guidelines. Ensuring compatibility
4.3. Threats to Validity                                             with widely-used tools will be crucial in positioning LLaMA
                                                                     as a go-to solution for developers working on real-world
Several threats to validity may impact the findings of this
                                                                     applications. Finally, the relatively low number of posts
study. Internal Validity: One potential threat is the assump-
                                                                     discussing the deployment and hosting of LLaMA models
tion that all posts in the dataset were relevant to Meta’s


                                                                41
suggests that this is still an emerging area. However, as             7. ACKNOWLEDGEMENT
more developers move toward deploying LLaMA models in
production environments, there will likely be an increasing           This research project was supported by the Faculty of ICT,
demand for comprehensive deployment tools, best practices,            Mahidol University.
and infrastructure support.
                                                                      References
6. Conclusion and Future Work
                                                                       [1] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
In conclusion, this preliminary study provides a detailed                  Lacha ux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,
investigation of the quality, the challenges, and related top-             F. Azhar, et al., Llama: Open and efficient foundation
ics in the Stack Overflow community’s discussions about                    language models, arXiv preprint arXiv:2302.13971
LLaMA. By understanding these areas, Meta and the broader                  (2023).
developer community can better support the use of LLaMA,               [2] S. Sadeghi, A. Bui, A. Forooghi, J. Lu, A. Ngom, Can
ultimately driving innovation in LLM development.                          large language models understand molecules?,
   This study provides valuable insights into the challenges               2024.      URL:       https://arxiv.org/abs/2402.00024.
developers face when adopting LLaMA, based on Stack Over-                  arXiv:2402.00024.
flow discussions. However, several areas for future research           [3] K. Wangsa, S. Karim, E. Gide, M. Elkhodr, A systematic
could significantly enrich the findings and address the limi-              review and comprehensive analysis of pioneering ai
tations identified in this study. First, expanding the dataset             chatbot models from education to healthcare: Chat-
to include posts beyond July 2024 will help capture evolving               gpt, bard, llama, ernie and grok, Future Internet 16
trends as LLaMA and other large language models (LLMs)                     (2024). URL: https://www.mdpi.com/1999-5903/16/7/
continue to develop. Additionally, incorporating data from                 219. doi:10.3390/fi16070219.
other platforms such as GitHub Issues, Reddit, and developer           [4] J. Son, B. Kim, Trend Analysis of Large Language
forums could provide a broader perspective on LLaMA’s us-                  Models through a Developer Community: A Focus on
age, especially on more complex technical problems and                     Stack Overflow, Information 14 (2023).
nuanced discussions that may not be captured on Stack                  [5] A. Hörnemalm, O. Norberg, T. Mejtoft, ChatGPT as a
Overflow alone. Comparing LLaMA to other LLMs, such as                     Software Development Tool The Future of Develop-
ChatGPT or Claude, would also provide valuable insights,                   ment, Master’s thesis, Umeå University, Department
allowing researchers to understand LLaMA’s challenges in                   of Applied Physics and Electronics, 2023.
the broader landscape and better justify its focus.                    [6] L. Zhong, Z. Wang, Can llm replace stack overflow? a
   Furthermore, future research should enhance the method-                 study on robustness and reliability of large language
ology by employing a more rigorous approach to data fil-                   model code generation, in: Proceedings of the AAAI
tering and analysis. Pre-processing the data to exclude triv-              Conference on Artificial Intelligence, volume 38, 2024,
ial questions and focusing on more substantial challenges                  pp. 21841–21849.
would yield more meaningful insights. Using established                [7] K. Jin, C.-Y. Wang, H. V. Pham, H. Hemmati, Can
qualitative coding frameworks for topic classification would               ChatGPT Support Developers? An Empirical Evalua-
further improve the transparency and validity of the analy-                tion of Large Language Models for Code Generation,
sis. Another promising direction is incorporating sentiment                in: Proceedings of the 21st International Conference
analysis to understand community attitudes toward LLaMA.                   on Mining Software Repositories, MSR ’24, 2024, p.
By analyzing the tone of discussions across platforms, re-                 167–171.
searchers could uncover whether developers’ experiences                [8] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sen-
with LLaMA are generally positive, negative, or neutral, of-               gupta, S. Yoo, J. M. Zhang, Large Language Models for
fering Meta and the developer community actionable feed-                   Software Engineering: Survey and Open Problems, in:
back for improving the tool.                                               ICSE-FoSE’23, 2023, pp. 31–53.
   Additionally, complementing the analysis with user stud-            [9] A. Buscemi, D. Proverbio, Chatgpt vs gemini vs llama
ies—such as surveys or interviews—could provide a deeper                   on multilingual sentiment analysis, 2024. URL: https:
understanding of the practical challenges faced by develop-                //arxiv.org/abs/2402.01715. arXiv:2402.01715.
ers using LLaMA in real-world scenarios. Exploring specific           [10] A. Salinas, P. Shah, Y. Huang, R. McCormack,
use cases where LLaMA is integrated into different applica-                F. Morstatter, The unequal opportunities of large
tion domains, such as natural language processing (NLP) or                 language models: Examining demographic biases
enterprise applications, could reveal unique challenges and                in job recommendations by chatgpt and llama, in:
benefits in various contexts. Finally, investigating advanced              Proceedings of the 3rd ACM Conference on Eq-
stages of LLaMA adoption, particularly in production envi-                 uity and Access in Algorithms, Mechanisms, and
ronments, would help identify issues related to deployment                 Optimization, EAAMO ’23, Association for Com-
and model hosting, offering a more complete picture of                     puting Machinery, New York, NY, USA, 2023.
LLaMA’s practical applications and limitations. By address-                URL: https://doi.org/10.1145/3617694.3623257. doi:10.
ing these areas, future research will contribute to a more                 1145/3617694.3623257.
comprehensive understanding of LLaMA’s role within the                [11] L. Da Silva, J. Samhi, F. Khomh, Chatgpt vs llama:
LLM ecosystem, driving more effective support for develop-                 Impact, reliability, and challenges in stack overflow
ers and fostering broader adoption of open-source LLMs.                    discussions, arXiv preprint arXiv:2402.08801 (2024).


                                                                 42

</pre>