=Paper=
{{Paper
|id=Vol-3762/489
|storemode=property
|title=GitHub Copilot: a systematic study
|pdfUrl=https://ceur-ws.org/Vol-3762/489.pdf
|volume=Vol-3762
|authors=Alessandro Benetti,Michele Filannino
|dblpUrl=https://dblp.org/rec/conf/ital-ia/BenettiF24
}}
==GitHub Copilot: a systematic study==
GitHub Copilot: a systematic study
Alessandro Benetti1,∗ , Michele Filannino1
1
Prometeia, Piazza Trento e Trieste, 30 - Bologna, 40137, Italy
Abstract
This paper examines the effects of GitHub Copilot, a prominent example of generative artificial intelligence (GAI), on software
development methodologies. Through an empirical study of GitHub Copilot’s performance in a professional setting, we assess
its value across various programming environments. Our comprehensive evaluation reveals that GitHub Copilot significantly
improves developer productivity and assistance in different coding scenarios. Furthermore, the research outlines effective
strategies for leveraging GitHub Copilot to its fullest potential, thus advancing the use of GAI tools in software engineering.
While recognizing GitHub Copilot’s considerable advantages, we also identify its shortcomings and areas in need of further
improvement.
Keywords
Generative AI, Software Engineering, GitHub, Software Development, Systematic Study, GAI, Coding Assistance
1. Introduction is part of the advanced Generative Pre-trained Trans-
former series. This model leverages the Transformer
The advent of Generative Artificial Intelligence (GAI) is architecture [10], renowned for its efficacy in processing
transforming our approach to creativity and the produc- and generating text that closely resembles human writing.
tion of new content. GAI encompasses machine learning GPT-3.5’s capabilities extend to a comprehensive under-
algorithms capable of generating content—ranging from standing of language subtleties, contextual nuances, and
images, videos, and text to music—that mirrors the style notably, programming code syntax.
and quality of human-created works. GitHub Copilot’s primary objective is to enable pro-
Recent breakthroughs in deep learning have given rise grammers to concentrate on problem-solving instead of
to sophisticated GAI models, such as latent diffusion mod- searching for the appropriate libraries and functions to
els [1] and Generative Pre-trained Transformers (GPT) implement their desired solution. With this tool, pro-
[2]. These models, capable of producing realistic and var- grammers can unlock a new level of productivity and
ied content with minimal human oversight, are trained efficiency and deliver high-quality code in a fraction of
on extensive datasets and generate new items by sam- the time that it typically takes.
pling from a learned probability distribution. The primary capabilities of GitHub Copilot can be
GAI’s potential is vast, with applications including the summarized as follows:
creation of lifelike virtual imagery (e.g., DALLE-3 [3],
Midjourney [4], Stable Diffusion [5]), serving as efficient • Natural language interface: Developers can inter-
writing assistants or conversational agents (e.g., ChatGPT act with GitHub Copilot using natural language
[6], LLAMA [7], Gemini [8], Claude [9]). However, the commands. This means they can describe what
rapid adoption of GAI technologies necessitates careful they want to achieve in plain English, and GitHub
consideration of their ethical and responsible use, partic- Copilot will suggest code to accomplish the task.
ularly in light of significant ethical and legal challenges • Integrated with IDEs: GitHub Copilot is inte-
such as intellectual property rights, privacy issues, and grated with popular code editors and IDEs, in-
the potential for misuse of GAI-generated content. cluding Visual Studio Code, Visual Studio MSDN,
and PyCharm.
• Context-aware: GitHub Copilot analyzes the con-
2. GitHub Copilot text of the code being written and generates sug-
gestions accordingly.
GitHub Copilot distinguishes itself as an innovative appli- • Privacy-focused: GitHub Copilot for Business
cation of GAI, offering substantial assistance to develop- does not retain telemetry or code snippets data.
ers in coding tasks. It is based on a GPT-3.5 model, which
While GitHub Copilot can be a powerful tool for devel-
opers, it is important to underline some of the potential
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
nized by CINI, May 29-30, 2024, Naples, Italy concerns that are also somewhat common to most Large
∗
Corresponding author. Language Models:
Envelope-Open alessandro.benetti@prometeia.com (A. Benetti);
michele.filannino@prometeia.com (M. Filannino) • Potentially inaccurate code: one potential con-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). cern with GitHub Copilot is that it may generate
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
incorrect or non-functional code. tracked in the previous ones. Therefore, we decided
• Limited world or codebase knowledge after the to adopt the SPACE framework (Forsgren, 2021 [13]),
training date: this might cause the suggestion which focuses on various aspects of developer productiv-
of deprecated methods for libraries that change ity, ranging from overall individual satisfaction to knowl-
significantly over time. edge sharing among different individuals. A summary of
• No match with the information that the program- these questions can be found at the following url.
mer has: this is true for both the overall context of
the code that it is suggesting, and some intrinsic 3.1. Participants Selection
knowledge about the world that the programmer
has, like awareness among other things. For this study, we selected 31 participants from three
specialized branches within our company, in particular:
To address these concerns, it is important for devel-
opers to carefully review and test the code generated by • Branch A, a development team of a long-
GitHub Copilot. standing software solution, working on both new
features and the maintenance of pre-existing
ones.
3. Methodology • Branch B, focused mostly on the development
of a new software product.
With the advent of this groundbreaking technology, it • Branch C, the development team of a software
is crucial to thoroughly evaluate its potential through cloud product, engaged with both development
extensive testing. At Prometeia, a software development- of new features and maintenance.
focused consulting firm, we’ve decided to embark on a
pilot study aimed explicitly at evaluating the functionali- These participants were selected due to their involve-
ties of GitHub Copilot. ment in a broad range of projects, encompassing both
We chose GitHub Copilot Business over alternatives innovative and established (legacy) projects. To promote
like Tabnine, Blackbox, and Sourcery due to its wide an unbiased evaluation, we refrained from assigning pre-
range of supported programming languages, compatibil- determined tasks, allowing participants to incorporate
ity with various Integrated Development Environments the tool into their regular workflow. Over a two-month
(IDEs), and advanced features that meet enterprise stan- observation period, we monitored their usage of GitHub
dards, including scalability, security, and compliance. Copilot, aiming to capture its utility across diverse project
Various reviews and studies, including those by types and user experiences. Notably, all of the selected
Vaithilingam [11], the GitHub Copilot study [12]. How- participants had no less than 1 years of programming
ever, these investigations have occasionally encountered experience.
contradictory findings and have not specifically concen- In order to gather participant feedback, we organized
trated on the implementation of this tool within a real- a series of in-person meetings, offering a forum for them
world corporate environment. to share their experiences with GitHub Copilot. Based on
Undertaking a pilot study offers numerous advantages, the insights gained during these discussions, we crafted
making it a strategic approach for our evaluation process. a 16-question survey covering the SPACE framework
Firstly, the pilot allows our developers to assess the tool’s dimensions (available at the following url). This survey
effectiveness by testing it on a small scale. This provides mixed questions from existing research with new ones
an opportunity to gauge how well GitHub Copilot can specifically designed for this study, including both closed
assist in achieving their objectives, determine if the gen- and open-ended questions. The closed-ended questions
erated code meets their requirements, and assess if it aimed to collect quantitative data, while the open-ended
improves their current development process. Secondly, ones sought to capture more nuanced feedback on their
a pilot study can assist us in identifying any limitations experiences. This approach aimed to collect quantitative
or potential issues with GitHub Copilot, such as difficul- and qualitative data to comprehensively evaluate GitHub
ties with specific programming languages or complex Copilot’s performance.
coding tasks. By identifying such limitations early on,
we can avoid potential problems and find alternatives
or workarounds to using the tool, thus saving time and 4. Results & Discussion
money in the long run. Hence, this initiative aimed to
determine whether GitHub Copilot would be a viable The following section presents some of the key findings
addition to our software development toolkit. obtained from the analysis of participant responses. This
To compare our findings with other studies, we in- section will be divided into three parts:
cluded most of the key performance indicators (KPIs)
• overall ratings: Participants were asked to rate suggestions by AI-based tools like GitHub Copilot. Addi-
GitHub Copilot on a scale from 1 to 10, with 10 tionally, the extensive availability of open-source code in
being the highest score. This rating serves as an these languages may provide a richer dataset for the AI’s
overall assessment of GitHub Copilot’s perfor- learning algorithms, enhancing its predictive accuracy
mance and effectiveness. and relevance. Conversely, the analysis reveals a mod-
• main benefits: This section highlights the areas est decline in satisfaction among C# developers, with
where GitHub Copilot excels. It examines the a mean score of 7. This discrepancy hints at possible
specific aspects or functionalities of the tool that limitations in GitHub Copilot’s adaptability or efficiency
participants found most valuable or beneficial in across different programming environments. The factors
their programming tasks. contributing to this variation could range from the struc-
• main drawbacks: In this part we explore the tural and syntactical idiosyncrasies of C# that challenge
challenges or limitations experienced by partic- the AI’s prediction models, to a potentially lesser volume
ipants when using GitHub Copilot. It focuses of training data derived from C# codebases.
on the areas where the tool may struggle or en- These insights advocate for a more nuanced approach
counter difficulties, as reported by the partici- to the continuous development and refinement of GitHub
pants. Copilot, emphasizing the need for language-specific op-
timizations to cater to the diverse requirements of the
In the following sections, we will delve deeper into development community. For users, the findings high-
these areas to comprehensively analyse the findings. light the importance of aligning expectations with the
capabilities and limitations of AI tools within specific
programming contexts.
4.2. Main benefits
The study’s findings, as visualized in Figure 2, delineate
the multifaceted benefits that GitHub Copilot offers to
developers, underscoring its impact on productivity and
code quality.
One of the principal advantages identified by partic-
ipants is Copilot’s proficiency in auto-generating boil-
erplate code and foundational code structures. This
feature significantly reduces the time and effort required
during the initial phases of project setup, allowing de-
Figure 1: Overall rating given to GitHub Copilot, by program- velopers to bypass the tedium of crafting repetitive code
ming language patterns from scratch. Such efficiency in establishing
project infrastructure is not only a time-saver but also
enables a smoother transition to more complex develop-
ment tasks.
4.1. Overall feedback Moreover, GitHub Copilot’s contribution to code doc-
The evaluation of GitHub Copilot’s efficacy, as illustrated umentation represents another vital benefit. The tool’s
in Figure 1, reveals an overall positive reception, with ability to furnish quick and precise descriptions for func-
a computed mean rating of 7.4 on a scale where higher tions, classes, and various code segments assists develop-
values denote greater approval. This overarching as- ers in maintaining well-documented codebases. Proper
sessment, however, masks underlying variations in user documentation is crucial for enhancing code readability,
satisfaction that are closely linked to the specific pro- facilitating easier maintenance, and enabling smoother
gramming language in use by the developers. collaboration among team members. By automating this
A more granular analysis of the data shows that de- aspect, Copilot aids in ensuring that projects adhere to
velopers employing high-level programming languages, best practices in code documentation, thus elevating the
notably Python and Java, tend to assign higher ratings overall quality of the development process.
to GitHub Copilot. This distinction suggests a poten- The generation of test code for existing functions by
tial correlation between the nature of the programming GitHub Copilot is highlighted as a particularly advan-
language and the tool’s performance. High-level lan- tageous feature. This capability assists developers in
guages, characterized by their abstraction from machine creating comprehensive test suites, a critical component
languages and emphasis on readability, may inherently of the software development life cycle aimed at verifying
facilitate more accurate and contextually relevant code the correctness and reliability of code. Notably, we ad-
Figure 2: GitHub Copilot usefulness on different tasks
vised the developers to exercise increased caution when
incorporating tests authored by GitHub Copilot, given
their significant influence on the code’s overall reliability.
An interesting observation from the study is the strate-
gic utilization of time saved through GitHub Copilot’s
assistance. Many participants reported reallocating the
time gained to enhance the quality of their products
further by focusing on rigorous testing, refining doc-
umentation, or dedicating effort to areas of the project
that could benefit from manual oversight. Alternatively,
some participants chose to invest the saved time into per-
sonal development, such as exploring new programming
libraries, learning new tools, or contributing to other
projects. This flexibility underscores Copilot’s role not Figure 3: Percentage of code lines written by GitHub Copilot
just as a tool for immediate productivity gains but also as
an enabler for broader professional growth and product
quality enhancement.
dency to suggest repetitive code. Such suggestions
can potentially lead to less efficient or elegant coding
4.3. Main drawbacks solutions, contradicting the tool’s aim to streamline de-
While GitHub Copilot has been lauded for its ability to en- velopment efforts. This behavior might stem from the
hance developer productivity and streamline workflows, AI’s training data or its current understanding of best cod-
the tool is not without its limitations, which can impact ing practices, indicating an area for further refinement
its overall effectiveness in certain contexts. to ensure that Copilot consistently proposes high-quality
One notable concern is its integration with Integrated and contextually appropriate code.
Development Environments (IDEs), particularly when The effectiveness of GitHub Copilot appears to vary
used alongside other coding aids such as Intellisense. significantly when dealing with different types of code-
Some users have reported conflicts between GitHub bases. Specifically, its performance with large or legacy
Copilot and Intellisense, leading to potential confu- codebases presents challenges, as evidenced by a reported
sion and errors. This issue underscores the importance median contribution of merely 10% (Figure 3) to the lines
of seamless tool integration within the development en- of code written in such contexts. This reduced effec-
vironment to prevent disruption in the coding process. tiveness could be attributed to the AI’s limited ability to
Another drawback observed by users is the tool’s ten- fully comprehend the complexities and nuances of older
or more extensive codebases, leading to challenges in 5.1. Comparison with other studies
generating accurate and useful code suggestions.
To further understand the impact of GitHub Copilot, we
Conversely, GitHub Copilot demonstrates consider-
incorporated a key performance indicator (KPI) from a
ably greater efficiency with new codebases, where it con-
GitHub Copilot survey [12] for a direct comparison with
tributes to around 30% of the written code. This discrep-
our study’s outcomes. Our findings, reported in Table
ancy highlights Copilot’s aptitude for aiding in the rapid
1 showed notable differences from GitHub’s reported
development of new projects, where its capabilities in
results. Although our results still reflect a highly positive
generating boilerplate code and structuring new projects
sentiment, it is important to note that the differences may
can be most beneficial.
be attributed to the nature of the experiments conducted
by GitHub. Specifically, our study population consisted
5. Related works of individuals with impending deadlines, which could
influence their perceptions and experiences with the tool.
The assessment of Artificial Intelligence (AI) tools’ impact
on various sectors, particularly in software development, GitHub Prometeia
has been an area of interest in the recent years. The Question Overall Overall Branch A Branch B Branch C
advent of AI innovations has been consistently associ- Focus on more satisfying
work
74% 35% 36% 44% 27%
ated with enhancements in productivity levels and the Feel more productive 88% 32% 18% 44% 36%
Are faster with repetitive 96% 74% 82% 78% 63%
facilitation of a more intuitive process for coding, as doc- tasks
umented in the findings of Chen et al. (2021) [14], who
illustrated the positive ramifications of AI on software Table 1
engineering practices. GitHub study performance metrics
In the realm of code generation, the deployment of
deep learning methodologies, especially those utilizing The variations in percentages observed between the
Transformer models, has been met with considerable suc- groups can be ascribed to various factors, such as the pro-
cess. A noteworthy illustration of this is the study by gramming languages employed by developers and the
Feng et al. (2020) [15], which presents a model that sig- nature of the projects they were engaged in. For exam-
nificantly surpasses the efficiency of conventional code ple, diverse branches may adopt distinct programming
completion tools. This approach, which harnesses the practices, preferences, and project requirements, which
power of deep learning to achieve a contextual compre- can shape their views and usage of GitHub Copilot. Ad-
hension and prediction, showcases the potential of gener- ditionally, the nature of the projects (whether new or
ative AI to navigate and replicate complex coding idioms legacy) can also impact how the benefits and drawbacks
and patterns with remarkable accuracy. of GitHub Copilot are perceived. These factors highlight
Furthermore, research conducted by Yetistiren et al. the significance of considering the context in which the
(2022) [16] denotes the proficiency of GitHub Copilot tool is employed and comprehending its potential influ-
in understanding coding syntax. This underlines the ence on the recorded percentages.
broad spectrum of advantages offered by AI in the realm
of software development, extending beyond mere pro-
cedural improvements to include significant qualitative 6. Conclusions
enhancements in code management and optimization.
Despite the proliferation of studies and reviews in this The study revealed a generally positive overall rating
area, it is crucial to acknowledge the predominance of for GitHub Copilot, with developers giving it an average
in vitro research methodologies (where specific program- score of 7.4 (out of 10) and, as expected, ratings varied
ming tasks are assigned to participants) and the occa- based on the programming language used. Our develop-
sional emergence of conflicting findings, as highlighted ers identified several benefits of using GitHub Copilot.
by Vaithilingam (2022) [11] and the study on GitHub One of its main advantages is the ability to generate boil-
Copilot (2021) [12]. These discrepancies underscore the erplate and basic code structures, saving developers time
necessity for our own comprehensive in vivo study (in and effort during project setup. Additionally, the tool
which no specific programming tasks are prescribed), in- ensures proper code documentation by providing accu-
volving a diverse array of developers from various corpo- rate descriptions of functions, classes, and other code
rate sectors, employing different Integrated Development elements. Another notable benefit is its capability to
Environments (IDEs), and programming languages. generate test code for existing functions, contributing to
code reliability. Coherently with its original goal, this
study proved GitHub Copilot effective in supporting our
software development activities. As a result, various
branches of our company started using the tool as part of
their development standard toolkit. Lastly, we could not [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
evaluate the tool on junior programmers, which leaves L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-
an area of inquiry for future studies. Understanding how tention is all you need, 2023. arXiv:1706.03762 .
newcomers to the field, with potentially different learn- [11] P. Vaithilingam, T. Zhang, E. L. Glassman, Ex-
ing curves and development practices, interact with and pectation vs. experience: Evaluating the usabil-
benefit from GitHub Copilot could provide valuable in- ity of code generation tools powered by large
sights into its overall utility and areas for improvement. language models, CHI EA ’22, Association
for Computing Machinery, New York, NY, USA,
2022. URL: https://doi.org/10.1145/3491101.3519665.
Acknowledgments doi:10.1145/3491101.3519665 .
[12] Research: Quantifying GitHub Copilot’s impact on
The authors wish to express their sincere gratitude to
developer productivity and happiness, 2022. Ac-
Dave Burnison from GitHub for his feedback on the Copi-
cessed: 2024-03-13. GitHub Copilot Study.
lot service. His insights and expertise have significantly
[13] N. Forsgren, M.-A. Storey, C. Maddila, T. Zim-
contributed to the research and understanding of the
mermann, B. Houck, J. Butler, The SPACE of
impact of AI-assisted programming tools in software de-
developer productivity: There’s more to it than
velopment.
you think., Queue 19 (2021) 20–48. doi:10.1145/
3454122.3454124 .
References [14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
[1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,
B. Ommer, High-resolution image synthesis with M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
latent diffusion models, 2022. arXiv:2112.10752 . S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
[2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, M. Bavarian, C. Winter, P. Tillet, F. P. Such,
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, D. Cummings, M. Plappert, F. Chantzis, E. Barnes,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,
G. Krueger, T. Henighan, R. Child, A. Ramesh, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, W. Saunders, C. Hesse, A. N. Carr, J. Leike,
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, J. Achiam, V. Misra, E. Morikawa, A. Radford,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, M. Knight, M. Brundage, M. Murati, K. Mayer,
D. Amodei, Language models are few-shot learners, P. Welinder, B. McGrew, D. Amodei, S. McCandlish,
2020. arXiv:2005.14165 . I. Sutskever, W. Zaremba, Evaluating large language
[3] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, models trained on code, 2021. arXiv:2107.03374 .
L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., Improv- [15] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
ing image generation with better captions, Com- L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert:
puter Science. https://cdn. openai. com/papers/dall- A pre-trained model for programming and natural
e-3. pdf 2 (2023) 8. languages, 2020. arXiv:2002.08155 .
[4] Midjourney, https://www.midjourney.com/home, [16] B. Yetistiren, I. Ozsoy, E. Tuzun, Assessing the
2024. Accessed: March 25, 2024. quality of GitHub Copilot’s code generation, in:
[5] P. Esser, S. Kulal, A. Blattmann, R. Entezari, Proceedings of the 18th international conference
J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, on predictive models and data analytics in software
F. Boesel, et al., Scaling rectified flow transformers engineering, 2022, pp. 62–71.
for high-resolution image synthesis, arXiv preprint
arXiv:2403.03206 (2024).
[6] Chatgpt, https://openai.com/blog/chatgpt, 2024. Ac-
cessed: April 4, 2024.
[7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,
G. Lample, Llama: Open and efficient foundation
language models, 2023. arXiv:2302.13971 .
[8] Gemini-Team, Gemini: A family of highly capable
multimodal models, 2024. arXiv:2312.11805 .
[9] Claude, https://www.anthropic.com/claude, 2024.
Accessed: April 4, 2024.