-

Ital-IA

1613-0073

Copilot: a systematic study

Alessandro Benetti

alessandro.benetti@prometeia.com 0 1 2 3

Michele Filannino

michele.filannino@prometeia.com 0 1 2 3 0 Generative AI, Software Engineering, GitHub, Software Development , Systematic Study, GAI, Coding Assistance 1 Prometeia, Piazza Trento e Trieste , 30 - Bologna, 40137 , Italy 2 cluding Visual Studio Code , Visual Studio MSDN 3 Integrated with IDEs: GitHub Copilot is inte-

2024

4 29 30

improvement. This paper examines the efects of GitHub Copilot, a prominent example of generative artificial intelligence (GAI), on software development methodologies. Through an empirical study of GitHub Copilot's performance in a professional setting, we assess its value across various programming environments. Our comprehensive evaluation reveals that GitHub Copilot significantly improves developer productivity and assistance in diferent coding scenarios. Furthermore, the research outlines efective strategies for leveraging GitHub Copilot to its fullest potential, thus advancing the use of GAI tools in software engineering. While recognizing GitHub Copilot's considerable advantages, we also identify its shortcomings and areas in need of further on extensive datasets and generate new items by sam- the time that it typically takes. The advent of Generative Artificial Intelligence (GAI) is transforming our approach to creativity and the produc- and generating text that closely resembles human writing. ∗Corresponding author.

CEUR ceur-ws.org

1. Introduction

tion of new content. GAI encompasses machine learning algorithms capable of generating content—ranging from images, videos, and text to music—that mirrors the style and quality of human-created works.

Recent breakthroughs in deep learning have given rise

to sophisticated GAI models, such as latent difusion models [1] and Generative Pre-trained Transformers (GPT) [2]. These models, capable of producing realistic and varied content with minimal human oversight, are trained pling from a learned probability distribution.

GAI’s potential is vast, with applications including the

creation of lifelike virtual imagery (e.g., DALLE-3 [3], Midjourney [4], Stable Difusion [ 5]), serving as eficient writing assistants or conversational agents (e.g., ChatGPT [6], LLAMA [7], Gemini [8], Claude [9]). However, the rapid adoption of GAI technologies necessitates careful consideration of their ethical and responsible use, particularly in light of significant ethical and legal challenges such as intellectual property rights, privacy issues, and the potential for misuse of GAI-generated content.

2. GitHub Copilot GitHub Copilot distinguishes itself as an innovative application of GAI, ofering substantial assistance to developers in coding tasks. It is based on a GPT-3.5 model, which

and PyCharm. • Context-aware: GitHub Copilot analyzes the context of the code being written and generates suggestions accordingly. • Privacy-focused: GitHub Copilot for Business does not retain telemetry or code snippets data.

While GitHub Copilot can be a powerful tool for developers, it is important to underline some of the potential concerns that are also somewhat common to most Large Language Models:

• Potentially inaccurate code: one potential concern with GitHub Copilot is that it may generate incorrect or non-functional code. tracked in the previous ones. Therefore, we decided • Limited world or codebase knowledge after the to adopt the SPACE framework (Forsgren, 2021 [13]), training date: this might cause the suggestion which focuses on various aspects of developer productivof deprecated methods for libraries that change ity, ranging from overall individual satisfaction to knowlsignificantly over time. edge sharing among diferent individuals. A summary of • No match with the information that the program- these questions can be found at the following url. mer has: this is true for both the overall context of the code that it is suggesting, and some intrinsic 3.1. Participants Selection knowledge about the world that the programmer has, like awareness among other things.

For this study, we selected 31 participants from three

specialized branches within our company, in particular:

To address these concerns, it is important for developers to carefully review and test the code generated by GitHub Copilot.

3. Methodology • Branch A, a development team of a longstanding software solution, working on both new features and the maintenance of pre-existing ones. • Branch B, focused mostly on the development of a new software product. • Branch C, the development team of a software cloud product, engaged with both development of new features and maintenance.

With the advent of this groundbreaking technology, it

is crucial to thoroughly evaluate its potential through extensive testing. At Prometeia, a software developmentfocused consulting firm, we’ve decided to embark on a pilot study aimed explicitly at evaluating the functionali- These participants were selected due to their involveties of GitHub Copilot. ment in a broad range of projects, encompassing both

We chose GitHub Copilot Business over alternatives innovative and established (legacy) projects. To promote like Tabnine, Blackbox, and Sourcery due to its wide an unbiased evaluation, we refrained from assigning prerange of supported programming languages, compatibil- determined tasks, allowing participants to incorporate ity with various Integrated Development Environments the tool into their regular workflow. Over a two-month (IDEs), and advanced features that meet enterprise stan- observation period, we monitored their usage of GitHub dards, including scalability, security, and compliance. Copilot, aiming to capture its utility across diverse project

Various reviews and studies, including those by types and user experiences. Notably, all of the selected Vaithilingam [11], the GitHub Copilot study [12]. How- participants had no less than 1 years of programming ever, these investigations have occasionally encountered experience. contradictory findings and have not specifically concen- In order to gather participant feedback, we organized trated on the implementation of this tool within a real- a series of in-person meetings, ofering a forum for them world corporate environment. to share their experiences with GitHub Copilot. Based on

Undertaking a pilot study ofers numerous advantages, the insights gained during these discussions, we crafted making it a strategic approach for our evaluation process. a 16-question survey covering the SPACE framework Firstly, the pilot allows our developers to assess the tool’s dimensions (available at the following url). This survey efectiveness by testing it on a small scale. This provides mixed questions from existing research with new ones an opportunity to gauge how well GitHub Copilot can specifically designed for this study, including both closed assist in achieving their objectives, determine if the gen- and open-ended questions. The closed-ended questions erated code meets their requirements, and assess if it aimed to collect quantitative data, while the open-ended improves their current development process. Secondly, ones sought to capture more nuanced feedback on their a pilot study can assist us in identifying any limitations experiences. This approach aimed to collect quantitative or potential issues with GitHub Copilot, such as dificul- and qualitative data to comprehensively evaluate GitHub ties with specific programming languages or complex Copilot’s performance. coding tasks. By identifying such limitations early on, we can avoid potential problems and find alternatives or workarounds to using the tool, thus saving time and 4. Results & Discussion money in the long run. Hence, this initiative aimed to determine whether GitHub Copilot would be a viable The following section presents some of the key findings addition to our software development toolkit. obtained from the analysis of participant responses. This

To compare our findings with other studies, we in- section will be divided into three parts: cluded most of the key performance indicators (KPIs)

In the following sections, we will delve deeper into

these areas to comprehensively analyse the findings.

• overall ratings: Participants were asked to rate suggestions by AI-based tools like GitHub Copilot. AddiGitHub Copilot on a scale from 1 to 10, with 10 tionally, the extensive availability of open-source code in being the highest score. This rating serves as an these languages may provide a richer dataset for the AI’s overall assessment of GitHub Copilot’s perfor- learning algorithms, enhancing its predictive accuracy mance and efectiveness. and relevance. Conversely, the analysis reveals a mod• main benefits : This section highlights the areas est decline in satisfaction among C# developers, with where GitHub Copilot excels. It examines the a mean score of 7. This discrepancy hints at possible specific aspects or functionalities of the tool that limitations in GitHub Copilot’s adaptability or eficiency participants found most valuable or beneficial in across diferent programming environments. The factors their programming tasks. contributing to this variation could range from the struc• main drawbacks: In this part we explore the tural and syntactical idiosyncrasies of C# that challenge challenges or limitations experienced by partic- the AI’s prediction models, to a potentially lesser volume ipants when using GitHub Copilot. It focuses of training data derived from C# codebases. on the areas where the tool may struggle or en- These insights advocate for a more nuanced approach counter dificulties, as reported by the partici- to the continuous development and refinement of GitHub pants. Copilot, emphasizing the need for language-specific optimizations to cater to the diverse requirements of the development community. For users, the findings highlight the importance of aligning expectations with the capabilities and limitations of AI tools within specific programming contexts. 4.2. Main benefits

The study’s findings, as visualized in Figure 2, delineate the multifaceted benefits that GitHub Copilot ofers to developers, underscoring its impact on productivity and code quality.

One of the principal advantages identified by participants is Copilot’s proficiency in auto-generating boilerplate code and foundational code structures. This feature significantly reduces the time and efort required during the initial phases of project setup, allowing deFigure 1: Overall rating given to GitHub Copilot, by program- velopers to bypass the tedium of crafting repetitive code ming language patterns from scratch. Such eficiency in establishing project infrastructure is not only a time-saver but also enables a smoother transition to more complex development tasks. 4.1. Overall feedback Moreover, GitHub Copilot’s contribution to code docThe evaluation of GitHub Copilot’s eficacy, as illustrated umentation represents another vital benefit. The tool’s in Figure 1, reveals an overall positive reception, with ability to furnish quick and precise descriptions for funca computed mean rating of 7.4 on a scale where higher tions, classes, and various code segments assists developvalues denote greater approval. This overarching as- ers in maintaining well-documented codebases. Proper sessment, however, masks underlying variations in user documentation is crucial for enhancing code readability, satisfaction that are closely linked to the specific pro- facilitating easier maintenance, and enabling smoother gramming language in use by the developers. collaboration among team members. By automating this

A more granular analysis of the data shows that de- aspect, Copilot aids in ensuring that projects adhere to velopers employing high-level programming languages, best practices in code documentation, thus elevating the notably Python and Java, tend to assign higher ratings overall quality of the development process. to GitHub Copilot. This distinction suggests a poten- The generation of test code for existing functions by tial correlation between the nature of the programming GitHub Copilot is highlighted as a particularly advanlanguage and the tool’s performance. High-level lan- tageous feature. This capability assists developers in guages, characterized by their abstraction from machine creating comprehensive test suites, a critical component languages and emphasis on readability, may inherently of the software development life cycle aimed at verifying facilitate more accurate and contextually relevant code the correctness and reliability of code. Notably, we advised the developers to exercise increased caution when incorporating tests authored by GitHub Copilot, given their significant influence on the code’s overall reliability.

An interesting observation from the study is the strategic utilization of time saved through GitHub Copilot’s assistance. Many participants reported reallocating the time gained to enhance the quality of their products further by focusing on rigorous testing, refining documentation, or dedicating efort to areas of the project that could benefit from manual oversight. Alternatively, some participants chose to invest the saved time into personal development, such as exploring new programming libraries, learning new tools, or contributing to other projects. This flexibility underscores Copilot’s role not just as a tool for immediate productivity gains but also as an enabler for broader professional growth and product quality enhancement. dency to suggest repetitive code. Such suggestions can potentially lead to less eficient or elegant coding 4.3. Main drawbacks solutions, contradicting the tool’s aim to streamline deWhile GitHub Copilot has been lauded for its ability to en- velopment eforts. This behavior might stem from the hance developer productivity and streamline workflows, AI’s training data or its current understanding of best codthe tool is not without its limitations, which can impact ing practices, indicating an area for further refinement its overall efectiveness in certain contexts. to ensure that Copilot consistently proposes high-quality

One notable concern is its integration with Integrated and contextually appropriate code.

Development Environments (IDEs), particularly when The efectiveness of GitHub Copilot appears to vary used alongside other coding aids such as Intellisense. significantly when dealing with diferent types of codeSome users have reported conflicts between GitHub bases. Specifically, its performance with large or legacy Copilot and Intellisense, leading to potential confu- codebases presents challenges, as evidenced by a reported sion and errors. This issue underscores the importance median contribution of merely 10% (Figure 3) to the lines of seamless tool integration within the development en- of code written in such contexts. This reduced efecvironment to prevent disruption in the coding process. tiveness could be attributed to the AI’s limited ability to

Another drawback observed by users is the tool’s ten- fully comprehend the complexities and nuances of older or more extensive codebases, leading to challenges in generating accurate and useful code suggestions.

Conversely, GitHub Copilot demonstrates considerably greater eficiency with new codebases, where it contributes to around 30% of the written code. This discrepancy highlights Copilot’s aptitude for aiding in the rapid development of new projects, where its capabilities in generating boilerplate code and structuring new projects can be most beneficial.

5. Related works To further understand the impact of GitHub Copilot, we

incorporated a key performance indicator (KPI) from a GitHub Copilot survey [12] for a direct comparison with our study’s outcomes. Our findings, reported in Table 1 showed notable diferences from GitHub’s reported results. Although our results still reflect a highly positive sentiment, it is important to note that the diferences may be attributed to the nature of the experiments conducted by GitHub. Specifically, our study population consisted of individuals with impending deadlines, which could influence their perceptions and experiences with the tool.

The assessment of Artificial Intelligence (AI) tools’ impact on various sectors, particularly in software development, GitHub Prometeia has been an area of interest in the recent years. The Question Overall Overall Branch A Branch B Branch C advent of AI innovations has been consistently associ- wFoocruks on more satisfying 74% 35% 36% 44% 27% ated with enhancements in productivity levels and the Feel more productive 88% 32% 18% 44% 36% facilitation of a more intuitive process for coding, as doc- tAarsekfsaster with repetitive 96% 74% 82% 78% 63% umented in the findings of Chen et al. (2021) [ 14], who illustrated the positive ramifications of AI on software Table 1 engineering practices. GitHub study performance metrics

In the realm of code generation, the deployment of deep learning methodologies, especially those utilizing The variations in percentages observed between the Transformer models, has been met with considerable suc- groups can be ascribed to various factors, such as the process. A noteworthy illustration of this is the study by gramming languages employed by developers and the Feng et al. (2020) [15], which presents a model that sig- nature of the projects they were engaged in. For examnificantly surpasses the eficiency of conventional code ple, diverse branches may adopt distinct programming completion tools. This approach, which harnesses the practices, preferences, and project requirements, which power of deep learning to achieve a contextual compre- can shape their views and usage of GitHub Copilot. Adhension and prediction, showcases the potential of gener- ditionally, the nature of the projects (whether new or ative AI to navigate and replicate complex coding idioms legacy) can also impact how the benefits and drawbacks and patterns with remarkable accuracy. of GitHub Copilot are perceived. These factors highlight

Furthermore, research conducted by Yetistiren et al. the significance of considering the context in which the (2022) [16] denotes the proficiency of GitHub Copilot tool is employed and comprehending its potential influin understanding coding syntax. This underlines the ence on the recorded percentages. broad spectrum of advantages ofered by AI in the realm of software development, extending beyond mere procedural improvements to include significant qualitative 6. Conclusions enhancements in code management and optimization.

Despite the proliferation of studies and reviews in this area, it is crucial to acknowledge the predominance of in vitro research methodologies (where specific programming tasks are assigned to participants) and the occasional emergence of conflicting findings, as highlighted by Vaithilingam (2022) [11] and the study on GitHub Copilot (2021) [12]. These discrepancies underscore the necessity for our own comprehensive in vivo study (in which no specific programming tasks are prescribed), involving a diverse array of developers from various corporate sectors, employing diferent Integrated Development Environments (IDEs), and programming languages.

The study revealed a generally positive overall rating

for GitHub Copilot, with developers giving it an average score of 7.4 (out of 10) and, as expected, ratings varied based on the programming language used. Our developers identified several benefits of using GitHub Copilot.

One of its main advantages is the ability to generate boilerplate and basic code structures, saving developers time and efort during project setup. Additionally, the tool ensures proper code documentation by providing accurate descriptions of functions, classes, and other code elements. Another notable benefit is its capability to generate test code for existing functions, contributing to code reliability. Coherently with its original goal, this study proved GitHub Copilot efective in supporting our software development activities. As a result, various branches of our company started using the tool as part of their development standard toolkit. Lastly, we could not [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, evaluate the tool on junior programmers, which leaves L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Atan area of inquiry for future studies. Understanding how tention is all you need, 2023. arXiv:1706.03762. newcomers to the field, with potentially diferent learn- [11] P. Vaithilingam, T. Zhang, E. L. Glassman, Exing curves and development practices, interact with and pectation vs. experience: Evaluating the usabilbenefit from GitHub Copilot could provide valuable in- ity of code generation tools powered by large sights into its overall utility and areas for improvement. language models, CHI EA ’22, Association for Computing Machinery, New York, NY, USA, 2022. URL: https://doi.org/10.1145/3491101.3519665.

Acknowledgments doi:10.1145/3491101.3519665. [12] Research: Quantifying GitHub Copilot’s impact on The authors wish to express their sincere gratitude to developer productivity and happiness, 2022. AcDave Burnison from GitHub for his feedback on the Copi- cessed: 2024-03-13. GitHub Copilot Study. lot service. His insights and expertise have significantly [13] N. Forsgren, M.-A. Storey, C. Maddila, T. Zimcontributed to the research and understanding of the mermann, B. Houck, J. Butler, The SPACE of impact of AI-assisted programming tools in software de- developer productivity: There’s more to it than velopment. you think., Queue 19 (2021) 20–48. doi:10.1145/ 3454122.3454124.

References [14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, [1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, B. Ommer, High-resolution image synthesis with M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, latent difusion models, 2022. arXiv:2112.10752. S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, M. Bavarian, C. Winter, P. Tillet, F. P. Such, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, G. Krueger, T. Henighan, R. Child, A. Ramesh, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, W. Saunders, C. Hesse, A. N. Carr, J. Leike, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, J. Achiam, V. Misra, E. Morikawa, A. Radford, C. Berner, S. McCandlish, A. Radford, I. Sutskever, M. Knight, M. Brundage, M. Murati, K. Mayer, D. Amodei, Language models are few-shot learners, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, 2020. arXiv:2005.14165. I. Sutskever, W. Zaremba, Evaluating large language [3] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, models trained on code, 2021. arXiv:2107.03374.

L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., Improv- [15] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, ing image generation with better captions, Com- L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert: puter Science. https://cdn. openai. com/papers/dall- A pre-trained model for programming and natural e-3. pdf 2 (2023) 8. languages, 2020. arXiv:2002.08155. [4] Midjourney, https://www.midjourney.com/home, [16] B. Yetistiren, I. Ozsoy, E. Tuzun, Assessing the 2024. Accessed: March 25, 2024. quality of GitHub Copilot’s code generation, in: [5] P. Esser, S. Kulal, A. Blattmann, R. Entezari, Proceedings of the 18th international conference J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, on predictive models and data analytics in software F. Boesel, et al., Scaling rectified flow transformers engineering, 2022, pp. 62–71. for high-resolution image synthesis, arXiv preprint arXiv:2403.03206 (2024). [6] Chatgpt, https://openai.com/blog/chatgpt, 2024. Ac

cessed: April 4, 2024. [7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.

Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and eficient foundation language models, 2023. arXiv:2302.13971. [8] Gemini-Team, Gemini: A family of highly capable

multimodal models, 2024. arXiv:2312.11805. [9] Claude, https://www.anthropic.com/claude, 2024.

Accessed: April 4, 2024.