<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Ital-IA</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Copilot: a systematic study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Benetti</string-name>
          <email>alessandro.benetti@prometeia.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Filannino</string-name>
          <email>michele.filannino@prometeia.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Generative AI, Software Engineering, GitHub, Software Development</institution>
          ,
          <addr-line>Systematic Study, GAI, Coding Assistance</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Prometeia, Piazza Trento e Trieste</institution>
          ,
          <addr-line>30 - Bologna, 40137</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>cluding Visual Studio Code</institution>
          ,
          <addr-line>Visual Studio MSDN</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Integrated with IDEs: GitHub Copilot is inte-</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>4</volume>
      <fpage>29</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>improvement. This paper examines the efects of GitHub Copilot, a prominent example of generative artificial intelligence (GAI), on software development methodologies. Through an empirical study of GitHub Copilot's performance in a professional setting, we assess its value across various programming environments. Our comprehensive evaluation reveals that GitHub Copilot significantly improves developer productivity and assistance in diferent coding scenarios. Furthermore, the research outlines efective strategies for leveraging GitHub Copilot to its fullest potential, thus advancing the use of GAI tools in software engineering. While recognizing GitHub Copilot's considerable advantages, we also identify its shortcomings and areas in need of further on extensive datasets and generate new items by sam- the time that it typically takes. The advent of Generative Artificial Intelligence (GAI) is transforming our approach to creativity and the produc- and generating text that closely resembles human writing. ∗Corresponding author.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>tion of new content. GAI encompasses machine learning
algorithms capable of generating content—ranging from
images, videos, and text to music—that mirrors the style
and quality of human-created works.</p>
      <sec id="sec-2-1">
        <title>Recent breakthroughs in deep learning have given rise</title>
        <p>to sophisticated GAI models, such as latent difusion
models [1] and Generative Pre-trained Transformers (GPT)
[2]. These models, capable of producing realistic and
varied content with minimal human oversight, are trained
pling from a learned probability distribution.</p>
      </sec>
      <sec id="sec-2-2">
        <title>GAI’s potential is vast, with applications including the</title>
        <p>creation of lifelike virtual imagery (e.g., DALLE-3 [3],
Midjourney [4], Stable Difusion [ 5]), serving as eficient
writing assistants or conversational agents (e.g., ChatGPT
[6], LLAMA [7], Gemini [8], Claude [9]). However, the
rapid adoption of GAI technologies necessitates careful
consideration of their ethical and responsible use,
particularly in light of significant ethical and legal challenges
such as intellectual property rights, privacy issues, and
the potential for misuse of GAI-generated content.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. GitHub Copilot</title>
      <sec id="sec-3-1">
        <title>GitHub Copilot distinguishes itself as an innovative application of GAI, ofering substantial assistance to developers in coding tasks. It is based on a GPT-3.5 model, which</title>
        <p>and PyCharm.
• Context-aware: GitHub Copilot analyzes the
context of the code being written and generates
suggestions accordingly.
• Privacy-focused: GitHub Copilot for Business
does not retain telemetry or code snippets data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>While GitHub Copilot can be a powerful tool for developers, it is important to underline some of the potential concerns that are also somewhat common to most Large Language Models:</title>
        <p>• Potentially inaccurate code: one potential
concern with GitHub Copilot is that it may generate
incorrect or non-functional code. tracked in the previous ones. Therefore, we decided
• Limited world or codebase knowledge after the to adopt the SPACE framework (Forsgren, 2021 [13]),
training date: this might cause the suggestion which focuses on various aspects of developer
productivof deprecated methods for libraries that change ity, ranging from overall individual satisfaction to
knowlsignificantly over time. edge sharing among diferent individuals. A summary of
• No match with the information that the program- these questions can be found at the following url.
mer has: this is true for both the overall context of
the code that it is suggesting, and some intrinsic 3.1. Participants Selection
knowledge about the world that the programmer
has, like awareness among other things.</p>
      </sec>
      <sec id="sec-3-3">
        <title>For this study, we selected 31 participants from three</title>
        <p>specialized branches within our company, in particular:</p>
      </sec>
      <sec id="sec-3-4">
        <title>To address these concerns, it is important for developers to carefully review and test the code generated by GitHub Copilot.</title>
        <p>3. Methodology
• Branch A, a development team of a
longstanding software solution, working on both new
features and the maintenance of pre-existing
ones.
• Branch B, focused mostly on the development
of a new software product.
• Branch C, the development team of a software
cloud product, engaged with both development
of new features and maintenance.</p>
      </sec>
      <sec id="sec-3-5">
        <title>With the advent of this groundbreaking technology, it</title>
        <p>is crucial to thoroughly evaluate its potential through
extensive testing. At Prometeia, a software
developmentfocused consulting firm, we’ve decided to embark on a
pilot study aimed explicitly at evaluating the functionali- These participants were selected due to their
involveties of GitHub Copilot. ment in a broad range of projects, encompassing both</p>
        <p>We chose GitHub Copilot Business over alternatives innovative and established (legacy) projects. To promote
like Tabnine, Blackbox, and Sourcery due to its wide an unbiased evaluation, we refrained from assigning
prerange of supported programming languages, compatibil- determined tasks, allowing participants to incorporate
ity with various Integrated Development Environments the tool into their regular workflow. Over a two-month
(IDEs), and advanced features that meet enterprise stan- observation period, we monitored their usage of GitHub
dards, including scalability, security, and compliance. Copilot, aiming to capture its utility across diverse project</p>
        <p>Various reviews and studies, including those by types and user experiences. Notably, all of the selected
Vaithilingam [11], the GitHub Copilot study [12]. How- participants had no less than 1 years of programming
ever, these investigations have occasionally encountered experience.
contradictory findings and have not specifically concen- In order to gather participant feedback, we organized
trated on the implementation of this tool within a real- a series of in-person meetings, ofering a forum for them
world corporate environment. to share their experiences with GitHub Copilot. Based on</p>
        <p>Undertaking a pilot study ofers numerous advantages, the insights gained during these discussions, we crafted
making it a strategic approach for our evaluation process. a 16-question survey covering the SPACE framework
Firstly, the pilot allows our developers to assess the tool’s dimensions (available at the following url). This survey
efectiveness by testing it on a small scale. This provides mixed questions from existing research with new ones
an opportunity to gauge how well GitHub Copilot can specifically designed for this study, including both closed
assist in achieving their objectives, determine if the gen- and open-ended questions. The closed-ended questions
erated code meets their requirements, and assess if it aimed to collect quantitative data, while the open-ended
improves their current development process. Secondly, ones sought to capture more nuanced feedback on their
a pilot study can assist us in identifying any limitations experiences. This approach aimed to collect quantitative
or potential issues with GitHub Copilot, such as dificul- and qualitative data to comprehensively evaluate GitHub
ties with specific programming languages or complex Copilot’s performance.
coding tasks. By identifying such limitations early on,
we can avoid potential problems and find alternatives
or workarounds to using the tool, thus saving time and 4. Results &amp; Discussion
money in the long run. Hence, this initiative aimed to
determine whether GitHub Copilot would be a viable The following section presents some of the key findings
addition to our software development toolkit. obtained from the analysis of participant responses. This</p>
        <p>To compare our findings with other studies, we in- section will be divided into three parts:
cluded most of the key performance indicators (KPIs)</p>
      </sec>
      <sec id="sec-3-6">
        <title>In the following sections, we will delve deeper into</title>
        <p>these areas to comprehensively analyse the findings.</p>
        <p>• overall ratings: Participants were asked to rate suggestions by AI-based tools like GitHub Copilot.
AddiGitHub Copilot on a scale from 1 to 10, with 10 tionally, the extensive availability of open-source code in
being the highest score. This rating serves as an these languages may provide a richer dataset for the AI’s
overall assessment of GitHub Copilot’s perfor- learning algorithms, enhancing its predictive accuracy
mance and efectiveness. and relevance. Conversely, the analysis reveals a
mod• main benefits : This section highlights the areas est decline in satisfaction among C# developers, with
where GitHub Copilot excels. It examines the a mean score of 7. This discrepancy hints at possible
specific aspects or functionalities of the tool that limitations in GitHub Copilot’s adaptability or eficiency
participants found most valuable or beneficial in across diferent programming environments. The factors
their programming tasks. contributing to this variation could range from the
struc• main drawbacks: In this part we explore the tural and syntactical idiosyncrasies of C# that challenge
challenges or limitations experienced by partic- the AI’s prediction models, to a potentially lesser volume
ipants when using GitHub Copilot. It focuses of training data derived from C# codebases.
on the areas where the tool may struggle or en- These insights advocate for a more nuanced approach
counter dificulties, as reported by the partici- to the continuous development and refinement of GitHub
pants. Copilot, emphasizing the need for language-specific
optimizations to cater to the diverse requirements of the
development community. For users, the findings
highlight the importance of aligning expectations with the
capabilities and limitations of AI tools within specific
programming contexts.
4.2. Main benefits</p>
      </sec>
      <sec id="sec-3-7">
        <title>The study’s findings, as visualized in Figure 2, delineate the multifaceted benefits that GitHub Copilot ofers to developers, underscoring its impact on productivity and code quality.</title>
        <p>One of the principal advantages identified by
participants is Copilot’s proficiency in auto-generating
boilerplate code and foundational code structures. This
feature significantly reduces the time and efort required
during the initial phases of project setup, allowing
deFigure 1: Overall rating given to GitHub Copilot, by program- velopers to bypass the tedium of crafting repetitive code
ming language patterns from scratch. Such eficiency in establishing
project infrastructure is not only a time-saver but also
enables a smoother transition to more complex
development tasks.
4.1. Overall feedback Moreover, GitHub Copilot’s contribution to code
docThe evaluation of GitHub Copilot’s eficacy, as illustrated umentation represents another vital benefit. The tool’s
in Figure 1, reveals an overall positive reception, with ability to furnish quick and precise descriptions for
funca computed mean rating of 7.4 on a scale where higher tions, classes, and various code segments assists
developvalues denote greater approval. This overarching as- ers in maintaining well-documented codebases. Proper
sessment, however, masks underlying variations in user documentation is crucial for enhancing code readability,
satisfaction that are closely linked to the specific pro- facilitating easier maintenance, and enabling smoother
gramming language in use by the developers. collaboration among team members. By automating this</p>
        <p>A more granular analysis of the data shows that de- aspect, Copilot aids in ensuring that projects adhere to
velopers employing high-level programming languages, best practices in code documentation, thus elevating the
notably Python and Java, tend to assign higher ratings overall quality of the development process.
to GitHub Copilot. This distinction suggests a poten- The generation of test code for existing functions by
tial correlation between the nature of the programming GitHub Copilot is highlighted as a particularly
advanlanguage and the tool’s performance. High-level lan- tageous feature. This capability assists developers in
guages, characterized by their abstraction from machine creating comprehensive test suites, a critical component
languages and emphasis on readability, may inherently of the software development life cycle aimed at verifying
facilitate more accurate and contextually relevant code the correctness and reliability of code. Notably, we
advised the developers to exercise increased caution when
incorporating tests authored by GitHub Copilot, given
their significant influence on the code’s overall reliability.</p>
        <p>An interesting observation from the study is the
strategic utilization of time saved through GitHub Copilot’s
assistance. Many participants reported reallocating the
time gained to enhance the quality of their products
further by focusing on rigorous testing, refining
documentation, or dedicating efort to areas of the project
that could benefit from manual oversight. Alternatively,
some participants chose to invest the saved time into
personal development, such as exploring new programming
libraries, learning new tools, or contributing to other
projects. This flexibility underscores Copilot’s role not
just as a tool for immediate productivity gains but also as
an enabler for broader professional growth and product
quality enhancement.
dency to suggest repetitive code. Such suggestions
can potentially lead to less eficient or elegant coding
4.3. Main drawbacks solutions, contradicting the tool’s aim to streamline
deWhile GitHub Copilot has been lauded for its ability to en- velopment eforts. This behavior might stem from the
hance developer productivity and streamline workflows, AI’s training data or its current understanding of best
codthe tool is not without its limitations, which can impact ing practices, indicating an area for further refinement
its overall efectiveness in certain contexts. to ensure that Copilot consistently proposes high-quality</p>
        <p>One notable concern is its integration with Integrated and contextually appropriate code.</p>
        <p>Development Environments (IDEs), particularly when The efectiveness of GitHub Copilot appears to vary
used alongside other coding aids such as Intellisense. significantly when dealing with diferent types of
codeSome users have reported conflicts between GitHub bases. Specifically, its performance with large or legacy
Copilot and Intellisense, leading to potential confu- codebases presents challenges, as evidenced by a reported
sion and errors. This issue underscores the importance median contribution of merely 10% (Figure 3) to the lines
of seamless tool integration within the development en- of code written in such contexts. This reduced
efecvironment to prevent disruption in the coding process. tiveness could be attributed to the AI’s limited ability to</p>
        <p>Another drawback observed by users is the tool’s ten- fully comprehend the complexities and nuances of older
or more extensive codebases, leading to challenges in
generating accurate and useful code suggestions.</p>
        <p>Conversely, GitHub Copilot demonstrates
considerably greater eficiency with new codebases, where it
contributes to around 30% of the written code. This
discrepancy highlights Copilot’s aptitude for aiding in the rapid
development of new projects, where its capabilities in
generating boilerplate code and structuring new projects
can be most beneficial.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Related works</title>
      <sec id="sec-4-1">
        <title>To further understand the impact of GitHub Copilot, we</title>
        <p>incorporated a key performance indicator (KPI) from a
GitHub Copilot survey [12] for a direct comparison with
our study’s outcomes. Our findings, reported in Table
1 showed notable diferences from GitHub’s reported
results. Although our results still reflect a highly positive
sentiment, it is important to note that the diferences may
be attributed to the nature of the experiments conducted
by GitHub. Specifically, our study population consisted
of individuals with impending deadlines, which could
influence their perceptions and experiences with the tool.</p>
        <p>The assessment of Artificial Intelligence (AI) tools’ impact
on various sectors, particularly in software development, GitHub Prometeia
has been an area of interest in the recent years. The Question Overall Overall Branch A Branch B Branch C
advent of AI innovations has been consistently associ- wFoocruks on more satisfying 74% 35% 36% 44% 27%
ated with enhancements in productivity levels and the Feel more productive 88% 32% 18% 44% 36%
facilitation of a more intuitive process for coding, as doc- tAarsekfsaster with repetitive 96% 74% 82% 78% 63%
umented in the findings of Chen et al. (2021) [ 14], who
illustrated the positive ramifications of AI on software Table 1
engineering practices. GitHub study performance metrics</p>
        <p>In the realm of code generation, the deployment of
deep learning methodologies, especially those utilizing The variations in percentages observed between the
Transformer models, has been met with considerable suc- groups can be ascribed to various factors, such as the
process. A noteworthy illustration of this is the study by gramming languages employed by developers and the
Feng et al. (2020) [15], which presents a model that sig- nature of the projects they were engaged in. For
examnificantly surpasses the eficiency of conventional code ple, diverse branches may adopt distinct programming
completion tools. This approach, which harnesses the practices, preferences, and project requirements, which
power of deep learning to achieve a contextual compre- can shape their views and usage of GitHub Copilot.
Adhension and prediction, showcases the potential of gener- ditionally, the nature of the projects (whether new or
ative AI to navigate and replicate complex coding idioms legacy) can also impact how the benefits and drawbacks
and patterns with remarkable accuracy. of GitHub Copilot are perceived. These factors highlight</p>
        <p>Furthermore, research conducted by Yetistiren et al. the significance of considering the context in which the
(2022) [16] denotes the proficiency of GitHub Copilot tool is employed and comprehending its potential
influin understanding coding syntax. This underlines the ence on the recorded percentages.
broad spectrum of advantages ofered by AI in the realm
of software development, extending beyond mere
procedural improvements to include significant qualitative 6. Conclusions
enhancements in code management and optimization.</p>
        <p>Despite the proliferation of studies and reviews in this
area, it is crucial to acknowledge the predominance of
in vitro research methodologies (where specific
programming tasks are assigned to participants) and the
occasional emergence of conflicting findings, as highlighted
by Vaithilingam (2022) [11] and the study on GitHub
Copilot (2021) [12]. These discrepancies underscore the
necessity for our own comprehensive in vivo study (in
which no specific programming tasks are prescribed),
involving a diverse array of developers from various
corporate sectors, employing diferent Integrated Development
Environments (IDEs), and programming languages.</p>
      </sec>
      <sec id="sec-4-2">
        <title>The study revealed a generally positive overall rating</title>
        <p>for GitHub Copilot, with developers giving it an average
score of 7.4 (out of 10) and, as expected, ratings varied
based on the programming language used. Our
developers identified several benefits of using GitHub Copilot.</p>
        <p>One of its main advantages is the ability to generate
boilerplate and basic code structures, saving developers time
and efort during project setup. Additionally, the tool
ensures proper code documentation by providing
accurate descriptions of functions, classes, and other code
elements. Another notable benefit is its capability to
generate test code for existing functions, contributing to
code reliability. Coherently with its original goal, this
study proved GitHub Copilot efective in supporting our
software development activities. As a result, various
branches of our company started using the tool as part of
their development standard toolkit. Lastly, we could not [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
evaluate the tool on junior programmers, which leaves L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
Atan area of inquiry for future studies. Understanding how tention is all you need, 2023. arXiv:1706.03762.
newcomers to the field, with potentially diferent learn- [11] P. Vaithilingam, T. Zhang, E. L. Glassman,
Exing curves and development practices, interact with and pectation vs. experience: Evaluating the
usabilbenefit from GitHub Copilot could provide valuable in- ity of code generation tools powered by large
sights into its overall utility and areas for improvement. language models, CHI EA ’22, Association
for Computing Machinery, New York, NY, USA,
2022. URL: https://doi.org/10.1145/3491101.3519665.</p>
        <p>Acknowledgments doi:10.1145/3491101.3519665.
[12] Research: Quantifying GitHub Copilot’s impact on
The authors wish to express their sincere gratitude to developer productivity and happiness, 2022.
AcDave Burnison from GitHub for his feedback on the Copi- cessed: 2024-03-13. GitHub Copilot Study.
lot service. His insights and expertise have significantly [13] N. Forsgren, M.-A. Storey, C. Maddila, T.
Zimcontributed to the research and understanding of the mermann, B. Houck, J. Butler, The SPACE of
impact of AI-assisted programming tools in software de- developer productivity: There’s more to it than
velopment. you think., Queue 19 (2021) 20–48. doi:10.1145/
3454122.3454124.</p>
        <p>References [14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
[1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,
B. Ommer, High-resolution image synthesis with M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
latent difusion models, 2022. arXiv:2112.10752. S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
[2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, M. Bavarian, C. Winter, P. Tillet, F. P. Such,
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, D. Cummings, M. Plappert, F. Chantzis, E. Barnes,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,
G. Krueger, T. Henighan, R. Child, A. Ramesh, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, W. Saunders, C. Hesse, A. N. Carr, J. Leike,
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, J. Achiam, V. Misra, E. Morikawa, A. Radford,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, M. Knight, M. Brundage, M. Murati, K. Mayer,
D. Amodei, Language models are few-shot learners, P. Welinder, B. McGrew, D. Amodei, S. McCandlish,
2020. arXiv:2005.14165. I. Sutskever, W. Zaremba, Evaluating large language
[3] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, models trained on code, 2021. arXiv:2107.03374.</p>
        <p>L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., Improv- [15] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
ing image generation with better captions, Com- L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert:
puter Science. https://cdn. openai. com/papers/dall- A pre-trained model for programming and natural
e-3. pdf 2 (2023) 8. languages, 2020. arXiv:2002.08155.
[4] Midjourney, https://www.midjourney.com/home, [16] B. Yetistiren, I. Ozsoy, E. Tuzun, Assessing the
2024. Accessed: March 25, 2024. quality of GitHub Copilot’s code generation, in:
[5] P. Esser, S. Kulal, A. Blattmann, R. Entezari, Proceedings of the 18th international conference
J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, on predictive models and data analytics in software
F. Boesel, et al., Scaling rectified flow transformers engineering, 2022, pp. 62–71.
for high-resolution image synthesis, arXiv preprint
arXiv:2403.03206 (2024).
[6] Chatgpt, https://openai.com/blog/chatgpt, 2024.
Ac</p>
        <p>cessed: April 4, 2024.
[7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.</p>
        <p>Lachaux, T. Lacroix, B. Rozière, N. Goyal, E.
Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,
G. Lample, Llama: Open and eficient foundation
language models, 2023. arXiv:2302.13971.
[8] Gemini-Team, Gemini: A family of highly capable</p>
        <p>multimodal models, 2024. arXiv:2312.11805.
[9] Claude, https://www.anthropic.com/claude, 2024.</p>
        <p>Accessed: April 4, 2024.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>