GitHub Copilot: a systematic study Alessandro Benetti1,∗ , Michele Filannino1 1 Prometeia, Piazza Trento e Trieste, 30 - Bologna, 40137, Italy Abstract This paper examines the effects of GitHub Copilot, a prominent example of generative artificial intelligence (GAI), on software development methodologies. Through an empirical study of GitHub Copilot’s performance in a professional setting, we assess its value across various programming environments. Our comprehensive evaluation reveals that GitHub Copilot significantly improves developer productivity and assistance in different coding scenarios. Furthermore, the research outlines effective strategies for leveraging GitHub Copilot to its fullest potential, thus advancing the use of GAI tools in software engineering. While recognizing GitHub Copilot’s considerable advantages, we also identify its shortcomings and areas in need of further improvement. Keywords Generative AI, Software Engineering, GitHub, Software Development, Systematic Study, GAI, Coding Assistance 1. Introduction is part of the advanced Generative Pre-trained Trans- former series. This model leverages the Transformer The advent of Generative Artificial Intelligence (GAI) is architecture [10], renowned for its efficacy in processing transforming our approach to creativity and the produc- and generating text that closely resembles human writing. tion of new content. GAI encompasses machine learning GPT-3.5’s capabilities extend to a comprehensive under- algorithms capable of generating content—ranging from standing of language subtleties, contextual nuances, and images, videos, and text to music—that mirrors the style notably, programming code syntax. and quality of human-created works. GitHub Copilot’s primary objective is to enable pro- Recent breakthroughs in deep learning have given rise grammers to concentrate on problem-solving instead of to sophisticated GAI models, such as latent diffusion mod- searching for the appropriate libraries and functions to els [1] and Generative Pre-trained Transformers (GPT) implement their desired solution. With this tool, pro- [2]. These models, capable of producing realistic and var- grammers can unlock a new level of productivity and ied content with minimal human oversight, are trained efficiency and deliver high-quality code in a fraction of on extensive datasets and generate new items by sam- the time that it typically takes. pling from a learned probability distribution. The primary capabilities of GitHub Copilot can be GAI’s potential is vast, with applications including the summarized as follows: creation of lifelike virtual imagery (e.g., DALLE-3 [3], Midjourney [4], Stable Diffusion [5]), serving as efficient • Natural language interface: Developers can inter- writing assistants or conversational agents (e.g., ChatGPT act with GitHub Copilot using natural language [6], LLAMA [7], Gemini [8], Claude [9]). However, the commands. This means they can describe what rapid adoption of GAI technologies necessitates careful they want to achieve in plain English, and GitHub consideration of their ethical and responsible use, partic- Copilot will suggest code to accomplish the task. ularly in light of significant ethical and legal challenges • Integrated with IDEs: GitHub Copilot is inte- such as intellectual property rights, privacy issues, and grated with popular code editors and IDEs, in- the potential for misuse of GAI-generated content. cluding Visual Studio Code, Visual Studio MSDN, and PyCharm. • Context-aware: GitHub Copilot analyzes the con- 2. GitHub Copilot text of the code being written and generates sug- gestions accordingly. GitHub Copilot distinguishes itself as an innovative appli- • Privacy-focused: GitHub Copilot for Business cation of GAI, offering substantial assistance to develop- does not retain telemetry or code snippets data. ers in coding tasks. It is based on a GPT-3.5 model, which While GitHub Copilot can be a powerful tool for devel- opers, it is important to underline some of the potential Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- nized by CINI, May 29-30, 2024, Naples, Italy concerns that are also somewhat common to most Large ∗ Corresponding author. Language Models: Envelope-Open alessandro.benetti@prometeia.com (A. Benetti); michele.filannino@prometeia.com (M. Filannino) • Potentially inaccurate code: one potential con- © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). cern with GitHub Copilot is that it may generate CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings incorrect or non-functional code. tracked in the previous ones. Therefore, we decided • Limited world or codebase knowledge after the to adopt the SPACE framework (Forsgren, 2021 [13]), training date: this might cause the suggestion which focuses on various aspects of developer productiv- of deprecated methods for libraries that change ity, ranging from overall individual satisfaction to knowl- significantly over time. edge sharing among different individuals. A summary of • No match with the information that the program- these questions can be found at the following url. mer has: this is true for both the overall context of the code that it is suggesting, and some intrinsic 3.1. Participants Selection knowledge about the world that the programmer has, like awareness among other things. For this study, we selected 31 participants from three specialized branches within our company, in particular: To address these concerns, it is important for devel- opers to carefully review and test the code generated by • Branch A, a development team of a long- GitHub Copilot. standing software solution, working on both new features and the maintenance of pre-existing ones. 3. Methodology • Branch B, focused mostly on the development of a new software product. With the advent of this groundbreaking technology, it • Branch C, the development team of a software is crucial to thoroughly evaluate its potential through cloud product, engaged with both development extensive testing. At Prometeia, a software development- of new features and maintenance. focused consulting firm, we’ve decided to embark on a pilot study aimed explicitly at evaluating the functionali- These participants were selected due to their involve- ties of GitHub Copilot. ment in a broad range of projects, encompassing both We chose GitHub Copilot Business over alternatives innovative and established (legacy) projects. To promote like Tabnine, Blackbox, and Sourcery due to its wide an unbiased evaluation, we refrained from assigning pre- range of supported programming languages, compatibil- determined tasks, allowing participants to incorporate ity with various Integrated Development Environments the tool into their regular workflow. Over a two-month (IDEs), and advanced features that meet enterprise stan- observation period, we monitored their usage of GitHub dards, including scalability, security, and compliance. Copilot, aiming to capture its utility across diverse project Various reviews and studies, including those by types and user experiences. Notably, all of the selected Vaithilingam [11], the GitHub Copilot study [12]. How- participants had no less than 1 years of programming ever, these investigations have occasionally encountered experience. contradictory findings and have not specifically concen- In order to gather participant feedback, we organized trated on the implementation of this tool within a real- a series of in-person meetings, offering a forum for them world corporate environment. to share their experiences with GitHub Copilot. Based on Undertaking a pilot study offers numerous advantages, the insights gained during these discussions, we crafted making it a strategic approach for our evaluation process. a 16-question survey covering the SPACE framework Firstly, the pilot allows our developers to assess the tool’s dimensions (available at the following url). This survey effectiveness by testing it on a small scale. This provides mixed questions from existing research with new ones an opportunity to gauge how well GitHub Copilot can specifically designed for this study, including both closed assist in achieving their objectives, determine if the gen- and open-ended questions. The closed-ended questions erated code meets their requirements, and assess if it aimed to collect quantitative data, while the open-ended improves their current development process. Secondly, ones sought to capture more nuanced feedback on their a pilot study can assist us in identifying any limitations experiences. This approach aimed to collect quantitative or potential issues with GitHub Copilot, such as difficul- and qualitative data to comprehensively evaluate GitHub ties with specific programming languages or complex Copilot’s performance. coding tasks. By identifying such limitations early on, we can avoid potential problems and find alternatives or workarounds to using the tool, thus saving time and 4. Results & Discussion money in the long run. Hence, this initiative aimed to determine whether GitHub Copilot would be a viable The following section presents some of the key findings addition to our software development toolkit. obtained from the analysis of participant responses. This To compare our findings with other studies, we in- section will be divided into three parts: cluded most of the key performance indicators (KPIs) • overall ratings: Participants were asked to rate suggestions by AI-based tools like GitHub Copilot. Addi- GitHub Copilot on a scale from 1 to 10, with 10 tionally, the extensive availability of open-source code in being the highest score. This rating serves as an these languages may provide a richer dataset for the AI’s overall assessment of GitHub Copilot’s perfor- learning algorithms, enhancing its predictive accuracy mance and effectiveness. and relevance. Conversely, the analysis reveals a mod- • main benefits: This section highlights the areas est decline in satisfaction among C# developers, with where GitHub Copilot excels. It examines the a mean score of 7. This discrepancy hints at possible specific aspects or functionalities of the tool that limitations in GitHub Copilot’s adaptability or efficiency participants found most valuable or beneficial in across different programming environments. The factors their programming tasks. contributing to this variation could range from the struc- • main drawbacks: In this part we explore the tural and syntactical idiosyncrasies of C# that challenge challenges or limitations experienced by partic- the AI’s prediction models, to a potentially lesser volume ipants when using GitHub Copilot. It focuses of training data derived from C# codebases. on the areas where the tool may struggle or en- These insights advocate for a more nuanced approach counter difficulties, as reported by the partici- to the continuous development and refinement of GitHub pants. Copilot, emphasizing the need for language-specific op- timizations to cater to the diverse requirements of the In the following sections, we will delve deeper into development community. For users, the findings high- these areas to comprehensively analyse the findings. light the importance of aligning expectations with the capabilities and limitations of AI tools within specific programming contexts. 4.2. Main benefits The study’s findings, as visualized in Figure 2, delineate the multifaceted benefits that GitHub Copilot offers to developers, underscoring its impact on productivity and code quality. One of the principal advantages identified by partic- ipants is Copilot’s proficiency in auto-generating boil- erplate code and foundational code structures. This feature significantly reduces the time and effort required during the initial phases of project setup, allowing de- Figure 1: Overall rating given to GitHub Copilot, by program- velopers to bypass the tedium of crafting repetitive code ming language patterns from scratch. Such efficiency in establishing project infrastructure is not only a time-saver but also enables a smoother transition to more complex develop- ment tasks. 4.1. Overall feedback Moreover, GitHub Copilot’s contribution to code doc- The evaluation of GitHub Copilot’s efficacy, as illustrated umentation represents another vital benefit. The tool’s in Figure 1, reveals an overall positive reception, with ability to furnish quick and precise descriptions for func- a computed mean rating of 7.4 on a scale where higher tions, classes, and various code segments assists develop- values denote greater approval. This overarching as- ers in maintaining well-documented codebases. Proper sessment, however, masks underlying variations in user documentation is crucial for enhancing code readability, satisfaction that are closely linked to the specific pro- facilitating easier maintenance, and enabling smoother gramming language in use by the developers. collaboration among team members. By automating this A more granular analysis of the data shows that de- aspect, Copilot aids in ensuring that projects adhere to velopers employing high-level programming languages, best practices in code documentation, thus elevating the notably Python and Java, tend to assign higher ratings overall quality of the development process. to GitHub Copilot. This distinction suggests a poten- The generation of test code for existing functions by tial correlation between the nature of the programming GitHub Copilot is highlighted as a particularly advan- language and the tool’s performance. High-level lan- tageous feature. This capability assists developers in guages, characterized by their abstraction from machine creating comprehensive test suites, a critical component languages and emphasis on readability, may inherently of the software development life cycle aimed at verifying facilitate more accurate and contextually relevant code the correctness and reliability of code. Notably, we ad- Figure 2: GitHub Copilot usefulness on different tasks vised the developers to exercise increased caution when incorporating tests authored by GitHub Copilot, given their significant influence on the code’s overall reliability. An interesting observation from the study is the strate- gic utilization of time saved through GitHub Copilot’s assistance. Many participants reported reallocating the time gained to enhance the quality of their products further by focusing on rigorous testing, refining doc- umentation, or dedicating effort to areas of the project that could benefit from manual oversight. Alternatively, some participants chose to invest the saved time into per- sonal development, such as exploring new programming libraries, learning new tools, or contributing to other projects. This flexibility underscores Copilot’s role not Figure 3: Percentage of code lines written by GitHub Copilot just as a tool for immediate productivity gains but also as an enabler for broader professional growth and product quality enhancement. dency to suggest repetitive code. Such suggestions can potentially lead to less efficient or elegant coding 4.3. Main drawbacks solutions, contradicting the tool’s aim to streamline de- While GitHub Copilot has been lauded for its ability to en- velopment efforts. This behavior might stem from the hance developer productivity and streamline workflows, AI’s training data or its current understanding of best cod- the tool is not without its limitations, which can impact ing practices, indicating an area for further refinement its overall effectiveness in certain contexts. to ensure that Copilot consistently proposes high-quality One notable concern is its integration with Integrated and contextually appropriate code. Development Environments (IDEs), particularly when The effectiveness of GitHub Copilot appears to vary used alongside other coding aids such as Intellisense. significantly when dealing with different types of code- Some users have reported conflicts between GitHub bases. Specifically, its performance with large or legacy Copilot and Intellisense, leading to potential confu- codebases presents challenges, as evidenced by a reported sion and errors. This issue underscores the importance median contribution of merely 10% (Figure 3) to the lines of seamless tool integration within the development en- of code written in such contexts. This reduced effec- vironment to prevent disruption in the coding process. tiveness could be attributed to the AI’s limited ability to Another drawback observed by users is the tool’s ten- fully comprehend the complexities and nuances of older or more extensive codebases, leading to challenges in 5.1. Comparison with other studies generating accurate and useful code suggestions. To further understand the impact of GitHub Copilot, we Conversely, GitHub Copilot demonstrates consider- incorporated a key performance indicator (KPI) from a ably greater efficiency with new codebases, where it con- GitHub Copilot survey [12] for a direct comparison with tributes to around 30% of the written code. This discrep- our study’s outcomes. Our findings, reported in Table ancy highlights Copilot’s aptitude for aiding in the rapid 1 showed notable differences from GitHub’s reported development of new projects, where its capabilities in results. Although our results still reflect a highly positive generating boilerplate code and structuring new projects sentiment, it is important to note that the differences may can be most beneficial. be attributed to the nature of the experiments conducted by GitHub. Specifically, our study population consisted 5. Related works of individuals with impending deadlines, which could influence their perceptions and experiences with the tool. The assessment of Artificial Intelligence (AI) tools’ impact on various sectors, particularly in software development, GitHub Prometeia has been an area of interest in the recent years. The Question Overall Overall Branch A Branch B Branch C advent of AI innovations has been consistently associ- Focus on more satisfying work 74% 35% 36% 44% 27% ated with enhancements in productivity levels and the Feel more productive 88% 32% 18% 44% 36% Are faster with repetitive 96% 74% 82% 78% 63% facilitation of a more intuitive process for coding, as doc- tasks umented in the findings of Chen et al. (2021) [14], who illustrated the positive ramifications of AI on software Table 1 engineering practices. GitHub study performance metrics In the realm of code generation, the deployment of deep learning methodologies, especially those utilizing The variations in percentages observed between the Transformer models, has been met with considerable suc- groups can be ascribed to various factors, such as the pro- cess. A noteworthy illustration of this is the study by gramming languages employed by developers and the Feng et al. (2020) [15], which presents a model that sig- nature of the projects they were engaged in. For exam- nificantly surpasses the efficiency of conventional code ple, diverse branches may adopt distinct programming completion tools. This approach, which harnesses the practices, preferences, and project requirements, which power of deep learning to achieve a contextual compre- can shape their views and usage of GitHub Copilot. Ad- hension and prediction, showcases the potential of gener- ditionally, the nature of the projects (whether new or ative AI to navigate and replicate complex coding idioms legacy) can also impact how the benefits and drawbacks and patterns with remarkable accuracy. of GitHub Copilot are perceived. These factors highlight Furthermore, research conducted by Yetistiren et al. the significance of considering the context in which the (2022) [16] denotes the proficiency of GitHub Copilot tool is employed and comprehending its potential influ- in understanding coding syntax. This underlines the ence on the recorded percentages. broad spectrum of advantages offered by AI in the realm of software development, extending beyond mere pro- cedural improvements to include significant qualitative 6. Conclusions enhancements in code management and optimization. Despite the proliferation of studies and reviews in this The study revealed a generally positive overall rating area, it is crucial to acknowledge the predominance of for GitHub Copilot, with developers giving it an average in vitro research methodologies (where specific program- score of 7.4 (out of 10) and, as expected, ratings varied ming tasks are assigned to participants) and the occa- based on the programming language used. Our develop- sional emergence of conflicting findings, as highlighted ers identified several benefits of using GitHub Copilot. by Vaithilingam (2022) [11] and the study on GitHub One of its main advantages is the ability to generate boil- Copilot (2021) [12]. These discrepancies underscore the erplate and basic code structures, saving developers time necessity for our own comprehensive in vivo study (in and effort during project setup. Additionally, the tool which no specific programming tasks are prescribed), in- ensures proper code documentation by providing accu- volving a diverse array of developers from various corpo- rate descriptions of functions, classes, and other code rate sectors, employing different Integrated Development elements. Another notable benefit is its capability to Environments (IDEs), and programming languages. generate test code for existing functions, contributing to code reliability. Coherently with its original goal, this study proved GitHub Copilot effective in supporting our software development activities. As a result, various branches of our company started using the tool as part of their development standard toolkit. Lastly, we could not [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, evaluate the tool on junior programmers, which leaves L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- an area of inquiry for future studies. Understanding how tention is all you need, 2023. arXiv:1706.03762 . newcomers to the field, with potentially different learn- [11] P. Vaithilingam, T. Zhang, E. L. Glassman, Ex- ing curves and development practices, interact with and pectation vs. experience: Evaluating the usabil- benefit from GitHub Copilot could provide valuable in- ity of code generation tools powered by large sights into its overall utility and areas for improvement. language models, CHI EA ’22, Association for Computing Machinery, New York, NY, USA, 2022. URL: https://doi.org/10.1145/3491101.3519665. Acknowledgments doi:10.1145/3491101.3519665 . [12] Research: Quantifying GitHub Copilot’s impact on The authors wish to express their sincere gratitude to developer productivity and happiness, 2022. Ac- Dave Burnison from GitHub for his feedback on the Copi- cessed: 2024-03-13. GitHub Copilot Study. lot service. His insights and expertise have significantly [13] N. Forsgren, M.-A. Storey, C. Maddila, T. Zim- contributed to the research and understanding of the mermann, B. Houck, J. Butler, The SPACE of impact of AI-assisted programming tools in software de- developer productivity: There’s more to it than velopment. you think., Queue 19 (2021) 20–48. doi:10.1145/ 3454122.3454124 . References [14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, [1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, B. Ommer, High-resolution image synthesis with M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, latent diffusion models, 2022. arXiv:2112.10752 . S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, M. Bavarian, C. Winter, P. Tillet, F. P. Such, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, G. Krueger, T. Henighan, R. Child, A. Ramesh, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, W. Saunders, C. Hesse, A. N. Carr, J. Leike, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, J. Achiam, V. Misra, E. Morikawa, A. Radford, C. Berner, S. McCandlish, A. Radford, I. Sutskever, M. Knight, M. Brundage, M. Murati, K. Mayer, D. Amodei, Language models are few-shot learners, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, 2020. arXiv:2005.14165 . I. Sutskever, W. Zaremba, Evaluating large language [3] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, models trained on code, 2021. arXiv:2107.03374 . L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., Improv- [15] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, ing image generation with better captions, Com- L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert: puter Science. https://cdn. openai. com/papers/dall- A pre-trained model for programming and natural e-3. pdf 2 (2023) 8. languages, 2020. arXiv:2002.08155 . [4] Midjourney, https://www.midjourney.com/home, [16] B. Yetistiren, I. Ozsoy, E. Tuzun, Assessing the 2024. Accessed: March 25, 2024. quality of GitHub Copilot’s code generation, in: [5] P. Esser, S. Kulal, A. Blattmann, R. Entezari, Proceedings of the 18th international conference J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, on predictive models and data analytics in software F. Boesel, et al., Scaling rectified flow transformers engineering, 2022, pp. 62–71. for high-resolution image synthesis, arXiv preprint arXiv:2403.03206 (2024). [6] Chatgpt, https://openai.com/blog/chatgpt, 2024. Ac- cessed: April 4, 2024. [7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient foundation language models, 2023. arXiv:2302.13971 . [8] Gemini-Team, Gemini: A family of highly capable multimodal models, 2024. arXiv:2312.11805 . [9] Claude, https://www.anthropic.com/claude, 2024. Accessed: April 4, 2024.