=Paper=
{{Paper
|id=Vol-3762/489
|storemode=property
|title=GitHub Copilot: a systematic study
|pdfUrl=https://ceur-ws.org/Vol-3762/489.pdf
|volume=Vol-3762
|authors=Alessandro Benetti,Michele Filannino
|dblpUrl=https://dblp.org/rec/conf/ital-ia/BenettiF24
}}
==GitHub Copilot: a systematic study==
<pdf width="1500px">https://ceur-ws.org/Vol-3762/489.pdf</pdf>
<pre>
                                GitHub Copilot: a systematic study
                                Alessandro Benetti1,∗ , Michele Filannino1
                                1
                                    Prometeia, Piazza Trento e Trieste, 30 - Bologna, 40137, Italy


                                                   Abstract
                                                   This paper examines the effects of GitHub Copilot, a prominent example of generative artificial intelligence (GAI), on software
                                                   development methodologies. Through an empirical study of GitHub Copilot’s performance in a professional setting, we assess
                                                   its value across various programming environments. Our comprehensive evaluation reveals that GitHub Copilot significantly
                                                   improves developer productivity and assistance in different coding scenarios. Furthermore, the research outlines effective
                                                   strategies for leveraging GitHub Copilot to its fullest potential, thus advancing the use of GAI tools in software engineering.
                                                   While recognizing GitHub Copilot’s considerable advantages, we also identify its shortcomings and areas in need of further
                                                   improvement.

                                                   Keywords
                                                   Generative AI, Software Engineering, GitHub, Software Development, Systematic Study, GAI, Coding Assistance


                                1. Introduction                                                                                             is part of the advanced Generative Pre-trained Trans-
                                                                                                                                            former series. This model leverages the Transformer
                                The advent of Generative Artificial Intelligence (GAI) is                                                   architecture [10], renowned for its efficacy in processing
                                transforming our approach to creativity and the produc-                                                     and generating text that closely resembles human writing.
                                tion of new content. GAI encompasses machine learning                                                       GPT-3.5’s capabilities extend to a comprehensive under-
                                algorithms capable of generating content—ranging from                                                       standing of language subtleties, contextual nuances, and
                                images, videos, and text to music—that mirrors the style                                                    notably, programming code syntax.
                                and quality of human-created works.                                                                            GitHub Copilot’s primary objective is to enable pro-
                                   Recent breakthroughs in deep learning have given rise                                                    grammers to concentrate on problem-solving instead of
                                to sophisticated GAI models, such as latent diffusion mod-                                                  searching for the appropriate libraries and functions to
                                els [1] and Generative Pre-trained Transformers (GPT)                                                       implement their desired solution. With this tool, pro-
                                [2]. These models, capable of producing realistic and var-                                                  grammers can unlock a new level of productivity and
                                ied content with minimal human oversight, are trained                                                       efficiency and deliver high-quality code in a fraction of
                                on extensive datasets and generate new items by sam-                                                        the time that it typically takes.
                                pling from a learned probability distribution.                                                                 The primary capabilities of GitHub Copilot can be
                                   GAI’s potential is vast, with applications including the                                                 summarized as follows:
                                creation of lifelike virtual imagery (e.g., DALLE-3 [3],
                                Midjourney [4], Stable Diffusion [5]), serving as efficient                                                      • Natural language interface: Developers can inter-
                                writing assistants or conversational agents (e.g., ChatGPT                                                         act with GitHub Copilot using natural language
                                [6], LLAMA [7], Gemini [8], Claude [9]). However, the                                                              commands. This means they can describe what
                                rapid adoption of GAI technologies necessitates careful                                                            they want to achieve in plain English, and GitHub
                                consideration of their ethical and responsible use, partic-                                                        Copilot will suggest code to accomplish the task.
                                ularly in light of significant ethical and legal challenges                                                      • Integrated with IDEs: GitHub Copilot is inte-
                                such as intellectual property rights, privacy issues, and                                                          grated with popular code editors and IDEs, in-
                                the potential for misuse of GAI-generated content.                                                                 cluding Visual Studio Code, Visual Studio MSDN,
                                                                                                                                                   and PyCharm.
                                                                                                                                                 • Context-aware: GitHub Copilot analyzes the con-
                                2. GitHub Copilot                                                                                                  text of the code being written and generates sug-
                                                                                                                                                   gestions accordingly.
                                GitHub Copilot distinguishes itself as an innovative appli-                                                      • Privacy-focused: GitHub Copilot for Business
                                cation of GAI, offering substantial assistance to develop-                                                         does not retain telemetry or code snippets data.
                                ers in coding tasks. It is based on a GPT-3.5 model, which
                                                                                                          While GitHub Copilot can be a powerful tool for devel-
                                                                                                        opers, it is important to underline some of the potential
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
                                nized by CINI, May 29-30, 2024, Naples, Italy                           concerns that are also somewhat common to most Large
                                ∗
                                  Corresponding author.                                                 Language Models:
                                Envelope-Open alessandro.benetti@prometeia.com (A. Benetti);
                                michele.filannino@prometeia.com (M. Filannino)                                                                   • Potentially inaccurate code: one potential con-
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).                                                            cern with GitHub Copilot is that it may generate


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
       incorrect or non-functional code.                        tracked in the previous ones. Therefore, we decided
     • Limited world or codebase knowledge after the            to adopt the SPACE framework (Forsgren, 2021 [13]),
       training date: this might cause the suggestion           which focuses on various aspects of developer productiv-
       of deprecated methods for libraries that change          ity, ranging from overall individual satisfaction to knowl-
       significantly over time.                                 edge sharing among different individuals. A summary of
     • No match with the information that the program-          these questions can be found at the following url.
       mer has: this is true for both the overall context of
       the code that it is suggesting, and some intrinsic       3.1. Participants Selection
       knowledge about the world that the programmer
       has, like awareness among other things.                  For this study, we selected 31 participants from three
                                                                specialized branches within our company, in particular:
  To address these concerns, it is important for devel-
opers to carefully review and test the code generated by             • Branch A, a development team of a long-
GitHub Copilot.                                                        standing software solution, working on both new
                                                                       features and the maintenance of pre-existing
                                                                       ones.
3. Methodology                                                       • Branch B, focused mostly on the development
                                                                       of a new software product.
With the advent of this groundbreaking technology, it                • Branch C, the development team of a software
is crucial to thoroughly evaluate its potential through                cloud product, engaged with both development
extensive testing. At Prometeia, a software development-               of new features and maintenance.
focused consulting firm, we’ve decided to embark on a
pilot study aimed explicitly at evaluating the functionali-        These participants were selected due to their involve-
ties of GitHub Copilot.                                         ment in a broad range of projects, encompassing both
   We chose GitHub Copilot Business over alternatives           innovative and established (legacy) projects. To promote
like Tabnine, Blackbox, and Sourcery due to its wide            an unbiased evaluation, we refrained from assigning pre-
range of supported programming languages, compatibil-           determined tasks, allowing participants to incorporate
ity with various Integrated Development Environments            the tool into their regular workflow. Over a two-month
(IDEs), and advanced features that meet enterprise stan-        observation period, we monitored their usage of GitHub
dards, including scalability, security, and compliance.         Copilot, aiming to capture its utility across diverse project
   Various reviews and studies, including those by              types and user experiences. Notably, all of the selected
Vaithilingam [11], the GitHub Copilot study [12]. How-          participants had no less than 1 years of programming
ever, these investigations have occasionally encountered        experience.
contradictory findings and have not specifically concen-           In order to gather participant feedback, we organized
trated on the implementation of this tool within a real-        a series of in-person meetings, offering a forum for them
world corporate environment.                                    to share their experiences with GitHub Copilot. Based on
   Undertaking a pilot study offers numerous advantages,        the insights gained during these discussions, we crafted
making it a strategic approach for our evaluation process.      a 16-question survey covering the SPACE framework
Firstly, the pilot allows our developers to assess the tool’s   dimensions (available at the following url). This survey
effectiveness by testing it on a small scale. This provides     mixed questions from existing research with new ones
an opportunity to gauge how well GitHub Copilot can             specifically designed for this study, including both closed
assist in achieving their objectives, determine if the gen-     and open-ended questions. The closed-ended questions
erated code meets their requirements, and assess if it          aimed to collect quantitative data, while the open-ended
improves their current development process. Secondly,           ones sought to capture more nuanced feedback on their
a pilot study can assist us in identifying any limitations      experiences. This approach aimed to collect quantitative
or potential issues with GitHub Copilot, such as difficul-      and qualitative data to comprehensively evaluate GitHub
ties with specific programming languages or complex             Copilot’s performance.
coding tasks. By identifying such limitations early on,
we can avoid potential problems and find alternatives
or workarounds to using the tool, thus saving time and          4. Results & Discussion
money in the long run. Hence, this initiative aimed to
determine whether GitHub Copilot would be a viable              The following section presents some of the key findings
addition to our software development toolkit.                   obtained from the analysis of participant responses. This
   To compare our findings with other studies, we in-           section will be divided into three parts:
cluded most of the key performance indicators (KPIs)
     • overall ratings: Participants were asked to rate         suggestions by AI-based tools like GitHub Copilot. Addi-
       GitHub Copilot on a scale from 1 to 10, with 10          tionally, the extensive availability of open-source code in
       being the highest score. This rating serves as an        these languages may provide a richer dataset for the AI’s
       overall assessment of GitHub Copilot’s perfor-           learning algorithms, enhancing its predictive accuracy
       mance and effectiveness.                                 and relevance. Conversely, the analysis reveals a mod-
     • main benefits: This section highlights the areas         est decline in satisfaction among C# developers, with
       where GitHub Copilot excels. It examines the             a mean score of 7. This discrepancy hints at possible
       specific aspects or functionalities of the tool that     limitations in GitHub Copilot’s adaptability or efficiency
       participants found most valuable or beneficial in        across different programming environments. The factors
       their programming tasks.                                 contributing to this variation could range from the struc-
     • main drawbacks: In this part we explore the              tural and syntactical idiosyncrasies of C# that challenge
       challenges or limitations experienced by partic-         the AI’s prediction models, to a potentially lesser volume
       ipants when using GitHub Copilot. It focuses             of training data derived from C# codebases.
       on the areas where the tool may struggle or en-             These insights advocate for a more nuanced approach
       counter difficulties, as reported by the partici-        to the continuous development and refinement of GitHub
       pants.                                                   Copilot, emphasizing the need for language-specific op-
                                                                timizations to cater to the diverse requirements of the
  In the following sections, we will delve deeper into          development community. For users, the findings high-
these areas to comprehensively analyse the findings.            light the importance of aligning expectations with the
                                                                capabilities and limitations of AI tools within specific
                                                                programming contexts.

                                                                4.2. Main benefits
                                                                The study’s findings, as visualized in Figure 2, delineate
                                                                the multifaceted benefits that GitHub Copilot offers to
                                                                developers, underscoring its impact on productivity and
                                                                code quality.
                                                                   One of the principal advantages identified by partic-
                                                                ipants is Copilot’s proficiency in auto-generating boil-
                                                                erplate code and foundational code structures. This
                                                                feature significantly reduces the time and effort required
                                                                during the initial phases of project setup, allowing de-
Figure 1: Overall rating given to GitHub Copilot, by program-   velopers to bypass the tedium of crafting repetitive code
ming language                                                   patterns from scratch. Such efficiency in establishing
                                                                project infrastructure is not only a time-saver but also
                                                                enables a smoother transition to more complex develop-
                                                                ment tasks.
4.1. Overall feedback                                              Moreover, GitHub Copilot’s contribution to code doc-
The evaluation of GitHub Copilot’s efficacy, as illustrated     umentation represents another vital benefit. The tool’s
in Figure 1, reveals an overall positive reception, with        ability to furnish quick and precise descriptions for func-
a computed mean rating of 7.4 on a scale where higher           tions, classes, and various code segments assists develop-
values denote greater approval. This overarching as-            ers in maintaining well-documented codebases. Proper
sessment, however, masks underlying variations in user          documentation is crucial for enhancing code readability,
satisfaction that are closely linked to the specific pro-       facilitating easier maintenance, and enabling smoother
gramming language in use by the developers.                     collaboration among team members. By automating this
   A more granular analysis of the data shows that de-          aspect, Copilot aids in ensuring that projects adhere to
velopers employing high-level programming languages,            best practices in code documentation, thus elevating the
notably Python and Java, tend to assign higher ratings          overall quality of the development process.
to GitHub Copilot. This distinction suggests a poten-              The generation of test code for existing functions by
tial correlation between the nature of the programming          GitHub Copilot is highlighted as a particularly advan-
language and the tool’s performance. High-level lan-            tageous feature. This capability assists developers in
guages, characterized by their abstraction from machine         creating comprehensive test suites, a critical component
languages and emphasis on readability, may inherently           of the software development life cycle aimed at verifying
facilitate more accurate and contextually relevant code         the correctness and reliability of code. Notably, we ad-
Figure 2: GitHub Copilot usefulness on different tasks


vised the developers to exercise increased caution when
incorporating tests authored by GitHub Copilot, given
their significant influence on the code’s overall reliability.
   An interesting observation from the study is the strate-
gic utilization of time saved through GitHub Copilot’s
assistance. Many participants reported reallocating the
time gained to enhance the quality of their products
further by focusing on rigorous testing, refining doc-
umentation, or dedicating effort to areas of the project
that could benefit from manual oversight. Alternatively,
some participants chose to invest the saved time into per-
sonal development, such as exploring new programming
libraries, learning new tools, or contributing to other
projects. This flexibility underscores Copilot’s role not Figure 3: Percentage of code lines written by GitHub Copilot
just as a tool for immediate productivity gains but also as
an enabler for broader professional growth and product
quality enhancement.
                                                               dency to suggest repetitive code. Such suggestions
                                                               can potentially lead to less efficient or elegant coding
4.3. Main drawbacks                                            solutions, contradicting the tool’s aim to streamline de-
While GitHub Copilot has been lauded for its ability to en-    velopment   efforts. This behavior might stem from the
hance developer productivity and streamline workflows,         AI’s training data or its current understanding of best cod-
the tool is not without its limitations, which can impact      ing  practices, indicating an area for further refinement
its overall effectiveness in certain contexts.                 to ensure that Copilot consistently proposes high-quality
   One notable concern is its integration with Integrated and contextually appropriate code.
Development Environments (IDEs), particularly when                The effectiveness of GitHub Copilot appears to vary
used alongside other coding aids such as Intellisense. significantly when dealing with different types of code-
Some users have reported conflicts between GitHub bases. Specifically, its performance with large or legacy
Copilot and Intellisense, leading to potential confu- codebases presents challenges, as evidenced by a reported
sion and errors. This issue underscores the importance median contribution of merely 10% (Figure 3) to the lines
of seamless tool integration within the development en- of code written in such contexts. This reduced effec-
vironment to prevent disruption in the coding process. tiveness could be attributed to the AI’s limited ability to
   Another drawback observed by users is the tool’s ten- fully comprehend the complexities and nuances of older
or more extensive codebases, leading to challenges in          5.1. Comparison with other studies
generating accurate and useful code suggestions.
                                                               To further understand the impact of GitHub Copilot, we
   Conversely, GitHub Copilot demonstrates consider-
                                                               incorporated a key performance indicator (KPI) from a
ably greater efficiency with new codebases, where it con-
                                                               GitHub Copilot survey [12] for a direct comparison with
tributes to around 30% of the written code. This discrep-
                                                               our study’s outcomes. Our findings, reported in Table
ancy highlights Copilot’s aptitude for aiding in the rapid
                                                               1 showed notable differences from GitHub’s reported
development of new projects, where its capabilities in
                                                               results. Although our results still reflect a highly positive
generating boilerplate code and structuring new projects
                                                               sentiment, it is important to note that the differences may
can be most beneficial.
                                                               be attributed to the nature of the experiments conducted
                                                               by GitHub. Specifically, our study population consisted
5. Related works                                               of individuals with impending deadlines, which could
                                                               influence their perceptions and experiences with the tool.
The assessment of Artificial Intelligence (AI) tools’ impact
on various sectors, particularly in software development,                                   GitHub                   Prometeia

has been an area of interest in the recent years. The          Question                     Overall   Overall   Branch A   Branch B   Branch C

advent of AI innovations has been consistently associ-         Focus on more satisfying
                                                               work
                                                                                             74%       35%        36%        44%        27%

ated with enhancements in productivity levels and the          Feel more productive          88%       32%        18%        44%        36%
                                                               Are faster with repetitive    96%       74%        82%        78%        63%
facilitation of a more intuitive process for coding, as doc-   tasks
umented in the findings of Chen et al. (2021) [14], who
illustrated the positive ramifications of AI on software       Table 1
engineering practices.                                         GitHub study performance metrics
   In the realm of code generation, the deployment of
deep learning methodologies, especially those utilizing           The variations in percentages observed between the
Transformer models, has been met with considerable suc-        groups can be ascribed to various factors, such as the pro-
cess. A noteworthy illustration of this is the study by        gramming languages employed by developers and the
Feng et al. (2020) [15], which presents a model that sig-      nature of the projects they were engaged in. For exam-
nificantly surpasses the efficiency of conventional code       ple, diverse branches may adopt distinct programming
completion tools. This approach, which harnesses the           practices, preferences, and project requirements, which
power of deep learning to achieve a contextual compre-         can shape their views and usage of GitHub Copilot. Ad-
hension and prediction, showcases the potential of gener-      ditionally, the nature of the projects (whether new or
ative AI to navigate and replicate complex coding idioms       legacy) can also impact how the benefits and drawbacks
and patterns with remarkable accuracy.                         of GitHub Copilot are perceived. These factors highlight
   Furthermore, research conducted by Yetistiren et al.        the significance of considering the context in which the
(2022) [16] denotes the proficiency of GitHub Copilot          tool is employed and comprehending its potential influ-
in understanding coding syntax. This underlines the            ence on the recorded percentages.
broad spectrum of advantages offered by AI in the realm
of software development, extending beyond mere pro-
cedural improvements to include significant qualitative        6. Conclusions
enhancements in code management and optimization.
   Despite the proliferation of studies and reviews in this    The study revealed a generally positive overall rating
area, it is crucial to acknowledge the predominance of         for GitHub Copilot, with developers giving it an average
in vitro research methodologies (where specific program-       score of 7.4 (out of 10) and, as expected, ratings varied
ming tasks are assigned to participants) and the occa-         based on the programming language used. Our develop-
sional emergence of conflicting findings, as highlighted       ers identified several benefits of using GitHub Copilot.
by Vaithilingam (2022) [11] and the study on GitHub            One of its main advantages is the ability to generate boil-
Copilot (2021) [12]. These discrepancies underscore the        erplate and basic code structures, saving developers time
necessity for our own comprehensive in vivo study (in          and effort during project setup. Additionally, the tool
which no specific programming tasks are prescribed), in-       ensures proper code documentation by providing accu-
volving a diverse array of developers from various corpo-      rate descriptions of functions, classes, and other code
rate sectors, employing different Integrated Development       elements. Another notable benefit is its capability to
Environments (IDEs), and programming languages.                generate test code for existing functions, contributing to
                                                               code reliability. Coherently with its original goal, this
                                                               study proved GitHub Copilot effective in supporting our
                                                               software development activities. As a result, various
                                                               branches of our company started using the tool as part of
their development standard toolkit. Lastly, we could not [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
evaluate the tool on junior programmers, which leaves             L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-
an area of inquiry for future studies. Understanding how          tention is all you need, 2023. arXiv:1706.03762 .
newcomers to the field, with potentially different learn- [11] P. Vaithilingam, T. Zhang, E. L. Glassman, Ex-
ing curves and development practices, interact with and           pectation vs. experience: Evaluating the usabil-
benefit from GitHub Copilot could provide valuable in-            ity of code generation tools powered by large
sights into its overall utility and areas for improvement.        language models,         CHI EA ’22, Association
                                                                  for Computing Machinery, New York, NY, USA,
                                                                  2022. URL: https://doi.org/10.1145/3491101.3519665.
Acknowledgments                                                   doi:10.1145/3491101.3519665 .
                                                             [12] Research: Quantifying GitHub Copilot’s impact on
The authors wish to express their sincere gratitude to
                                                                  developer productivity and happiness, 2022. Ac-
Dave Burnison from GitHub for his feedback on the Copi-
                                                                  cessed: 2024-03-13. GitHub Copilot Study.
lot service. His insights and expertise have significantly
                                                             [13] N. Forsgren, M.-A. Storey, C. Maddila, T. Zim-
contributed to the research and understanding of the
                                                                  mermann, B. Houck, J. Butler, The SPACE of
impact of AI-assisted programming tools in software de-
                                                                  developer productivity: There’s more to it than
velopment.
                                                                  you think., Queue 19 (2021) 20–48. doi:10.1145/
                                                                  3454122.3454124 .
References                                                   [14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
                                                                  de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
 [1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser,               N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,
      B. Ommer, High-resolution image synthesis with              M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
      latent diffusion models, 2022. arXiv:2112.10752 .           S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
 [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,                  M. Bavarian, C. Winter, P. Tillet, F. P. Such,
      J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,           D. Cummings, M. Plappert, F. Chantzis, E. Barnes,
      G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,          A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,
      G. Krueger, T. Henighan, R. Child, A. Ramesh,               N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain,
      D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,         W. Saunders, C. Hesse, A. N. Carr, J. Leike,
      E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,          J. Achiam, V. Misra, E. Morikawa, A. Radford,
      C. Berner, S. McCandlish, A. Radford, I. Sutskever,         M. Knight, M. Brundage, M. Murati, K. Mayer,
      D. Amodei, Language models are few-shot learners,           P. Welinder, B. McGrew, D. Amodei, S. McCandlish,
      2020. arXiv:2005.14165 .                                    I. Sutskever, W. Zaremba, Evaluating large language
 [3] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li,       models trained on code, 2021. arXiv:2107.03374 .
      L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., Improv- [15] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
      ing image generation with better captions, Com-             L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert:
      puter Science. https://cdn. openai. com/papers/dall-        A pre-trained model for programming and natural
      e-3. pdf 2 (2023) 8.                                        languages, 2020. arXiv:2002.08155 .
 [4] Midjourney, https://www.midjourney.com/home, [16] B. Yetistiren, I. Ozsoy, E. Tuzun, Assessing the
      2024. Accessed: March 25, 2024.                             quality of GitHub Copilot’s code generation, in:
 [5] P. Esser, S. Kulal, A. Blattmann, R. Entezari,               Proceedings of the 18th international conference
      J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer,          on predictive models and data analytics in software
      F. Boesel, et al., Scaling rectified flow transformers      engineering, 2022, pp. 62–71.
      for high-resolution image synthesis, arXiv preprint
      arXiv:2403.03206 (2024).
 [6] Chatgpt, https://openai.com/blog/chatgpt, 2024. Ac-
      cessed: April 4, 2024.
 [7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
      Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
      bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,
      G. Lample, Llama: Open and efficient foundation
      language models, 2023. arXiv:2302.13971 .
 [8] Gemini-Team, Gemini: A family of highly capable
      multimodal models, 2024. arXiv:2312.11805 .
 [9] Claude, https://www.anthropic.com/claude, 2024.
      Accessed: April 4, 2024.

</pre>