Towards an Empirical Analysis of Code Cloning and Code Reuse in CI/CD Ecosystems Guillaume Cardoen1 1 Software Engineering Lab, University of Mons, Belgium Abstract Large open source projects are engaged in collaborative software development through social coding platforms, and use CI/CD practices to automate numerous repetitive tasks through workflows. Most CI/CD tools follow the configuration-as-code paradigm, specifying their workflow configurations as runnable workflow files. We posit that, just as is the case when maintaining regular source code, workflow configuration files are subject to the good and bad practices of reusability and cloning. This paper provides the plan of my doctoral research, explaining the objectives and research questions, and outlining the research method to reach these objectives. My research focuses on the empirical analysis of how code reuse and code cloning practices emerge and evolve in workflow files. The initial focus is on GitHub, taking GitHub Actions as a case study, given that it is by far the most popular CI/CD used in GitHub. Keywords configuration as code, empirical analysis, code cloning, code reuse, collaborative software development, GitHub 1. Introduction Contemporary collaborative software development relies on a plethora of tools that support or stream- line the development process such as version control systems, issue trackers, CI/CD and workflow automation tools, quality checkers, security scanners and many more. Such tools tend to be integrated into collaborative social coding platforms. GitHub [1] is the largest social coding platform today with 100M+ users and hosting 284M+ public projects in 2023 [2]. GitLab [3] is another very popular social coding platform with an estimation of over 30 millions registered users [4]. Continuous integration and delivery (CI/CD) is a pillar for collaborative software development [5], allowing to automate the numerous repetitive tasks being performed by development teams (such as license agreements, automatic answers, code reviewing, quality analysis, test coverage, automatic issue triaging, ...). Many CI/CD services have emerged throughout the years (e.g., Travis, Jenkins, CircleCI, AppVeyor), and other CI/CD services have been directly integrated into social coding platforms (e.g. GitLab CI/CD since September 2015 [3] and GitHub Actions since November 2019 [6]). Most of today’s CI/CD tools rely on configuration files (called workflows in GitHub Actions and pipelines in GitLab CI/CD) to specify automated jobs. These jobs can be executed on runners provided by the social coding platform (or through dedicated user-provided runners), based on triggers that are either manual or event-based (such as a pull request, commit, release, recurrent schedule, signal from an external service, and many more). Given that CI/CD configuration files can be “executed”, they can be seen as an instance of the so- called configuration-as-code practice [7]. Considering configuration files as code opens up an entire new spectrum of research opportunities, since it allows to empirically observe and study the same coding practices as those that have been studied for several decades for traditional programming languages. One of these is the study of good and bad code quality practices, and more in particular the study of code reuse and code cloning practices. Indeed, many of the reported “code smells” in the research literature on software quality are related to code duplication [8, 9]. Previous research has already demonstrated the existence of code duplication in Dockerfiles [10, 11], as well as higher maintenance BENEVOL24: The 23rd Belgium-Netherlands Software Evolution Workshop, November 21-22, Namur, Belgium $ Guillaume.CARDOEN@umons.ac.be (G. Cardoen)  0009-0005-2008-3565 (G. Cardoen) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings effort, prolonged fixes and inconsistent changes due to clones in build configuration files for Java systems [12]. I thus posit that cloning and reuse practices are also likely to be used in CI/CD configuration files. This makes it relevant to empirically study the prevalence and impact of such practices in project repositories hosted on social coding platforms like GitHub and GitLab. I hypothesise that project maintainers often tend to create new CI/CD configurations by copy-pasting existing configurations from elsewhere, either from other repositories they own or know about, or by relying on starter template configurations that are freely provided by the social coding platform. It remains an open question whether such copy-and-paste reuse is actually beneficial in the specific case of CI/CD, how duplicated code evolves, and whether it leads to maintainability problems in the long run. Building further on the preliminary findings of my master’s thesis [13], the goal of my PhD research is therefore to evaluate the code cloning and code reuse practices in CI/CD ecosystems, with an initial focus on GitHub Actions, and extending the scope of the study to other CI/CD ecosystems such as GitLab CI/CD in later phases. The remainder of this article is structured as follows. Section 2 presents the research context, focusing on related research on code reuse and code cloning, and presenting the GitHub Actions CI/CD ecosystem that will be used as the first case study. Section 3 presents the research goals, subdivided into nine research questions. Section 4 presents my current progress. 2. Context and Related Research 2.1. Code Reuse and Code Cloning According to Krueger [14], software reuse consists of building new software based on existing software artifacts (such as code fragments or design structures). Such reuse is a means to reduce development time and maintenance effort needed to build large and complex programs, as well as to enhance their general software quality posture [14]. Similarly, code reuse is the use of code snippets from existing systems or components developed especially for being reused [15]. Many studies relating to code reuse can be found in the literature [16]. Such code reuse can be typically classified into multiple strategies: copy-paste, clone-and-own and platform orientation [16]. Copy-paste or clone-and-own reuse can lead to similar or identical code fragments, commonly referred to as code clones [9]. They could arise accidentally, but are typically introduced due to code duplication, forking, merging of similar systems, automated code generation tools, or due to changes made by developers during maintenance [17, 18]. Code clones are a heavily studied field in software engineering [19, 20, 21]. Code clones can have beneficial effects such as easier understanding of the code, reduced development time, or ensuring robustness [22, 9, 18]. On the other hand, code clones are also considered as bad smells [8] as they tend to increase the code base size, may lead to bug propagation, have a negative impact on code readability, hide the originality of the code, increase maintenance cost or result in inconsistent and repetitive bug fixes [23, 9]. Due to both positive and negative effects of code clones, researchers propose that code clones should at least be detected, even in the absence of refactoring them [17]. Code clones are often categorized into four types [24]: Type I clones represents code fragments that are identical, ignoring differences in comments or whitespace; Type II clones extends Type I by allowing different identifiers, literals, methods names and types; Type III clones includes Type II by allowing added, removed or modified code lines; Type IV clones are so-called semantic clones that have the same functional goal and behaviour even if they may look different syntactically. Many general-purpose code clone detection tools exist [19, 17]. To name but a few: NiCad [25], CCFinder [26], CCFinderSW [27], SourcererCC [28], Deckard [29], and CCSharp [30]. They can be classified into multiple categories, including text-based, token-based, tree-based, graph-based, metric- based, and learning-based. Hybrid clone detection approaches may belong to multiple categories [19]. Some tools also offer automatic refactoring of code clones. Arcelli Fontana et al. [31] developed such a tool for Java project, using NiCad [25] as clone detector. Code fragment similarity and code fingerprinting are closely related fields to code clone detection. Similarity Preserving Hashing Functions (SPHF) are functions mapping two similar inputs to two similar outputs [32]. Fingerprinting approaches divide documents in multiple term sequences and derive one or multiple fingerprints from these sequences. Documents sharing one or more equivalent fingerprints are susceptible to contain reused code fragments [33]. Both SPHF and fingerprinting approaches have been proposed for code clone detection [34, 33, 32]. Many empirical studies on code clones have been carried out in the past. To name but a few, Pate et al. [35] presented a systematic literature review related to code clone evolution. Mondal et al. [36] analysed the propagation of bugs through code clones, concluding that 18.42% of code fragments where a bug fix occurred are due to propagated bugs. Estefó et al. [37] studied code duplication in Robot Operating System and found that almost half of the analysed launch configuration files contained at least one code clone. Oumaziz et al. [10] analysed duplicated code in Dockerfiles via an index-based duplicate detection, finding that new instructions are frequent and issues experienced by developers explains a majority of modifications. Oumaziz [11] studied code clones in API documentation and Dockerfiles, finding that a majority of instructions in Dockerfiles are duplicated. McIntosh et al. [12] collected and analysed 3,872 open source build system configuration files. Considering only Type I clones, they concluded that half of build logic lines are cloned at least once in Java build systems. However, they noticed some build systems with a limited number of clones, suggesting cloning is not a necessity in build system configuration files. They observed that inconsistent changes and prolonged fixes, known to be clone-related problems in general-purpose code [23], are also affecting build system files. 2.2. GitHub Actions GitHub is the most prominent social coding platform [38]. GitHub Actions, being the most popular CI/CD system used in GitHub projects [39], is arguably one of the most popular CI/CD ecosystem. This section presents its concepts and existing empirical research that has been conducted on this quite recent ecosystem. GitHub has started to provide support for CI/CD within GitHub itself since November 2019 through GitHub Actions, which quickly became the most frequently employed CI/CD solution for GitHub projects [39]. Listing 1 shows a workflow configuration file, given by GitHub,1 for building a Python package and releasing it on PyPI. The workflow is stored in a YAML file respecting a given syntax specified by the GitHub documentation.2 A workflow file has at least two main parts describing when it is triggered (on: key) and what should be performed (jobs: key). The on: key describes the list of events upon which the workflow should run. Lines 3-5 of Listing 1 declare that the workflow should run each time a new release is published. The jobs: key declares one or more jobs, each composed of one or more steps. A step can use (uses: key) a reusable component, called Action, or run (run: key) a sequence of shell commands. Actions will be described in Section 2.3.1. Line 14 in Listing 1 declares a call to the actions/checkout Action (version 4) to clone the reposi- tory, while line 16 calls version 3 of another Action actions/setup-python to install and configure Python in the environment. Lines 17-18 are declaring a variable python-version used by the Action in order to specify the python version to install. Lines 20-22 declare the execution of two shell commands, the first one installing pip and the second one installing the Python package build with pip. Line 24 launches the python build system. Finally, line 26 calls the Action pypa/gh-action-pypi-publish uploading the built Python package to PyPI. Lines 27-29 declare two parameters that will be passed to the Action. Prior research on GitHub Actions has focused on the adoption and usage of GitHub Actions. Kinsman et al. [40] studied how developers use GitHub Actions in open-source projects after its adoption. Golzadeh et al. [39] quantitatively studied the usage of different CI/CD tools used in GitHub and 1 https://github.com/actions/starter-workflows/blob/main/ci/python-publish.yml 2 https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions Listing 1: GitHub Actions workflow file for building a Python package 1 name: Upload Python Package 2 3 on: 4 release: 5 types: [published] 6 7 permissions: 8 contents: read 9 10 jobs: 11 deploy: 12 runs-on: ubuntu-latest 13 steps: 14 - uses: actions/checkout@v4 15 - name: Set up Python 16 uses: actions/setup-python@v3 17 with: 18 python-version: ’3.12’ 19 - name: Install dependencies 20 run: | 21 python -m pip install --upgrade pip 22 pip install build 23 - name: Build package 24 run: python -m build 25 - name: Publish package 26 uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29 27 with: 28 user: __token__ 29 password: ${{ secrets.PYPI_API_TOKEN }} Rostami Mazrae et al. [41] qualitatively analysed the usage, co-usage and migration among CI/CD tools, noticing a migration towards GitHub Actions. Valenzuela-Toledo and Bergel [42] studied the evolution of GitHub Actions workflow files. Mazrae et al. [43] studied the main changes in workflow files, concluding that files are subject to frequent modifications. Decan et al. [6] analysed the reuse of Actions and automation practices for GitHub Actions workflows, concluding nearly all jobs make use of them. Saroar and Nayebi [44] qualitatively surveyed Actions developers in the goal of understanding how they use, create and search for Actions and identifying possible challenges they face while doing so. Khatami et al. [45] analysed 10,012 commits from the 83 most populars repositories using GitHub Actions looking for different kind of smells. They proposed an automatic smell detectors for each identified smell, and gave insights on how developers reacted to those smells, concluding 7 out of 22 smells are revelant to developers whereas 9 out of 22 had mixed reactions. 2.3. GitHub Actions Code Reusability Mechanisms The integrated CI/CD mechanisms offered by GitHub provide a variety of methods for facilitating code reuse in workflows. This section describes the principal means of code reuse available in GitHub Actions. 2.3.1. Reusable Actions GitHub Actions supports different type of reusable components directly referenceable inside its work- flows. In the GitHub Actions ecosystem, reusable components are called Actions. An Action represents an individual task, combinable with others to create a job. They can either consist of a single Javascript Listing 2: GitHub Action workflow file for building a Python package via a reusable workflow 1 # previous keys (on, permissions, etc.) 2 3 jobs: 4 deploy: 5 uses: ./.github/workflows/pypi-reusable.yaml 6 with: 7 python-version: 3.12 8 username: __token__ 9 secrets: 10 password: ${{ secrets.PYPI_API_TOKEN }} file being executed, be composed themselves of Actions or shell commands, or refer to a Docker con- tainer. In the latter case, the container is then used as environment for the job, thus allowing specific version and dependency management. The objective of Actions is thus to replace frequent sequences of commands following the principle of "don’t repeat youself" (DRY). The GitHub Marketplace3 enables maintainers to search for and share Actions. Alternatively, an Action can be referenced by the repository hosting its code. 2.3.2. Reusable Workflows Reusable workflows4 represent another possibility for code reuse. As its name suggests, reusable workflows enables the maintainer to declare a workflow reusable by others. Such a reusable workflow is declared in a very similar manner as done in Listing 1. Once declared, this workflow can be referenced elsewhere, possibly in workflow files located in different repositories. Unlike Actions, a reusable workflow can contain multiple jobs, each composed of multiple steps. By using this reusability mechanism, the workflow presented at the Listing 1 can be transformed to reference a reusable workflow. The resulting workflow, shown at the Listing 2, is a shorter workflow file, where the steps of the deploy job are replaced by a reference to a reusable workflow (located in the same repository in our example), which describes these steps. 2.3.3. Composite Actions Composite Actions are a type of reusable Actions. A composite Action bundles a list of Actions or shell commands as a single Action, thus appearing as a single step in a workflow file. Assuming a composite Action example/build-publish-python@v1 is declared and bundles the last four steps of the workflow presented at the Listing 1, we could simplify our workflow by replacing these steps with a reference to example/build-publish-python@v1. An example of resulting workflow is presented at the Listing 3. Line 8 calls a composite Action, while lines 9-12 are giving needed variables to the Action (such as username and password). 2.3.4. Starter Workflows GitHub freely provides examples and templates of GitHub Actions workflows, called starter workflows. Such starter workflows can be found freely in a GitHub repository5 or can also be integrated automati- cally via the GitHub web interface. Maintainers can copy-paste the content into their own workflows and adapt them to their own use-case. Such reuse allows for a different starting point when creating a new workflow file, offering an alternative to a blank workflow file. Nevertheless, such reuse can also result in the creation of a significant amount of code clones across workflow files. 3 https://github.com/marketplace?type=actions 4 https://docs.github.com/en/actions/sharing-automations/reusing-workflows 5 https://github.com/actions/starter-workflows Listing 3: GitHub Action workflow file for building a Python package via a composite action 1 # previous keys (on, permissions, etc.) 2 3 jobs: 4 deploy: 5 runs-on: ubuntu-latest 6 steps: 7 - uses: actions/checkout@v4 8 - uses: example/build-publish-python@v1 9 with: 10 python-version: 3.12 11 username: __token__ 12 password: ${{ secrets.PYPI_API_TOKEN }} As an example, the previously presented workflow in Listing 1 is one of the available starter workflow, where only the python version was changed from 3.x to 3.12. 3. Research Goals Since configuration-as-code files essentially contain code, I hypothesise that they are affected by the majority of issues encountered in general-purpose code, even though the syntax for expressing configuration files can be quite different from the one of traditional programming languages. In the specific case of GitHub Actions, Saroar and Nayebi [44] qualitatively concluded that writing a workflow based on a previously written workflow file and using a starter workflow as a basis are the two most popular methods used by consulted GitHub Actions maintainers when writing a workflow file. The majority of consulted maintainers thus appear to be basing themselves on other workflows, which may be indicative of the popularity of copy-paste or clone-and-own reuse across workflow files. Such reuse might lead to a substantial number of code clones in GitHub Actions workflow files. I thus hypothesise that CI/CD configuration files also suffer from code cloning, due to copy-paste or clone-and-own reuse and the many observed reasons for code duplication in general purpose code (such as automatic code generation [17]). My doctoral research therefore aims to empirically study code clones in configuration as code files of CI/CD tools. As a first step (G1-3), I will study the GitHub Actions ecosystem, and in a final step (G4), I will extend this research to a wider range of tools, and also compare the practices and impact of code cloning across these tools. To achieve these different goals, I plan on using a mixed-method empirical analysis as advocated by Creswell [46]. Such mixed-method empirical analysis entails two primary research methodologies: qualitative and quantitative. Quantitative studies and mining of software repositories are typically sufficient to inform decision-making. However, a qualitative approach may be necessary to better understand the underlying causes behind specific phenomena, which might prove difficult to study via a quantitative approach. 3.1. G1: Assessing Code Cloning Practices in the GitHub Actions Ecosystem G1 will mainly focus on GitHub Actions workflow files. After a careful review of the literature, I found no previous research analysing code clones for GitHub Actions. In the goal of quantifying and describing code clones in GitHub Actions, we will proceed to address the following research questions: RQ1: What are code clones in workflows and how to identify them? This question will study how code clones can be detected in GitHub Actions workflow files. The order and structure of the syntactic keys used in workflow files (such as ‘on:’, ‘jobs:’, ‘uses:’ or ‘run:’) lead to quite some repetition and syntactic similarity across workflow files. We therefore expect current clone detection algorithms to yield many false positives and lose precision when applied on configuration code, which can be syntactically quite different from general-purpose code. We plan to devise an algorithm taking these syntactic specificities into account in order to precisely detect code clones in workflow files. We will exclude Type IV clones from the analysis as they do not depend on the syntax. RQ2: How prevalent are code clones in workflow files? This question aims to quantify code clones across workflow files. We plan to empirically analyse code clone presence in GitHub Actions workflow files, thus confirming the presence of code clones as well as the extent of duplicated code in such files. RQ3: What are the characteristics of code clones? Clone characteristics such as length or position will be studied as well as social characteristics such as the author and maintainer of code clones. This question will also analyse the code clone contents, e.g., whether they are complete jobs, sequence of steps or events. How code clones differ or resemble one another will also be studied in this question. Such analysis will provide insights on which parts of workflow configurations maintainers tend to clone and why. 3.2. G2: Understanding code cloning practices In order to gain a deeper comprehension of code clones in GitHub Actions workflow files, this second goal aims to further analyse code cloning practices of workflow files by studying clones provenance, evolution and impacts. More specifically, we will address the following three research questions: RQ4: Where do code clones come from? Assuming code cloning is a popular practice among GitHub Actions workflows maintainers, such code cloning could come from several sources. Maintainers could use their own workflows, their organisation workflows or GitHub starter workflows as a basis. Based on the results of G1, we will construct and study the code clone evolution history, also known as clone genealogy, inspired by existing research [47, 48]. By leveraging an existing differencing tool for GitHub Actions workflow files developed at our lab [49], we will quantitatively study the origin of code clones. Such a study will provide insights in how code clones are introduced in workflow files and knowledge in how workflow files are created. In addition, we will also qualitatively study how developers tend to reuse code. The process of choosing what code to reuse and the practice chosen by developers to do so will also be a part of this study. RQ5: How do code clones co-evolve? By further studying the clone genealogy constructed in RQ2, this question aims to understand the co-evolution of code clones. Code clones can follow multiple evolution patterns such as consistent evolution (i.e., code fragments stays similar to a given similarity metric), independent evolution (i.e., code fragments become different to a given similarity metric), delayed or late propagation (i.e., code fragments changes consistently at different times) [50]. Which evolution pattern do code clones follow in the specific case of GitHub Action is still unknown. RQ6: What is the impact of code clones? This question seeks to understand how code clones affect workflow files and their creation or usage. In general purpose code, code clones can be beneficial in certain instances, yet exert a detrimental effect in others [22, 9, 18, 23, 8]. However, it is still unknown whether code clones impact positively or negatively workflow files. This question will thus study the different impacts of code clones in workflows, basing its answer on the previously studied aspects of clones (studied in RQ4 and RQ5). 3.3. G3: Improving code reuse practices The objective of this goal is to improve code reuse practices in GitHub Actions workflow files by assisting maintainers in the avoidance of code clones considered detrimental. This goal is highly dependent on the conclusions of G1 and G2. It is our expectations that previous goals will enable us to improve code reuse practices in GitHub Actions workflows. However, this goal will be adapted or removed depending on the conclusions of G1 and G2. RQ7: Are reusable components introduced in workflows? This question seeks to understand whether code clones transform into reusable components in the context of GitHub Actions workflow files, and whether this change impacts their code clones. GitHub introduced multiple code reusability mechanisms (such as reusable workflows or composite Actions) that could limit code cloning. However, whether maintainers refactor their workflow files in order to use these mechanisms and thus, reduce code duplication, is still unknown. Similarly, the reasons to use (or avoid) such mechanisms and the difficulties encountered by maintainers in doing so are still unknown and will be qualitatively studied as part of this research question. RQ8: How to help maintainers avoid code clones? As it is already the case with general purpose code [9], we expect code clones in GitHub Actions workflow files to have positive and negative effects. (The veracity of these expectations will be es- tablished in RQ6). Though some workflows may not need nor benefit from code clones removal, we expect that refactoring often encountered code fragments will mitigate or remove adverse effects of code clones, thus helping maintainers. The objective of this research question is thus to identify and implement strategies in order to help maintainers avoid code clones, or at least mitigate the adverse effects of such clones. Such strategies could include refactoring code clones into reusable components (such as Actions or reusable workflows). Such refactoring could use code clones in order to create reusable components and provide an automatic patch introducing such components. This automatic refactoring may be a response to possible difficulties encountered by maintainers as will be studied in RQ4. Due to limitations of reusability mechanisms GitHub or specific needs in workflows, we expect to be unable to refactor some code clones. Another strategy thus involves automatically propagating changes of code clones in order to support their co-evolution studied in RQ5. Such a strategy will mitigate or remove bug propagation while also reducing maintenance costs, which are two of the adverse effects of code clones [9]. This research question will be divided into multiple research questions as my doctoral research progresses. As my doctoral research has only recently begun, I currently do not have enough hindsight related to this question to provide a detailed description. 3.4. G4: Generalisation to Other CI/CD Tools and Social Coding Platforms Other CI/CD tools (such as Jenkins6 , CircleCI7 , Travis8 or AppVeyor9 ) or social coding platforms (such as GitLab10 or Gitea11 ) uses configuration-as-code practice in the scope of CI/CD configuration. The second goal aims to expand the scope of the study to other CI/CD ecosystems, allowing us to compare how code cloning and reuse varies between different CI/CD ecosystems using configuration-as-code practice. The methods of this second goal is based on the same fundamental concepts of G1-3 methods. A large-scale and representative dataset of configuration files will be required for each considered CI/CD. 6 https://www.jenkins.io/ 7 https://circleci.com/ 8 https://www.travis-ci.com 9 https://www.appveyor.com/ 10 https://docs.gitlab.com/ee/ci/ 11 https://about.gitea.com/ Large-scale archiving system (such as Software Heritage [51]) could be a first step in retrieving enough data for different CI/CD. Moreover, in the particular case of GitLab, such archiving system could also preserve self-hosted GitLab instances, allowing us to consider such instances. Concerning the detection of code clones, I plan to adapt the algorithm developed in Goal 1 to other configurations files, beginning with those based on YAML for simplicity. The depth of this goal and its feasibility will be dependent on the time left in my thesis. The following research question will be addressed by this goal: RQ9: What are the similarities between code clones of different CI/CD ecosystems? This question aims to compare the different code clones between different CI/CD ecosystems providing knowledge about whether practices related to code clones differ per ecosystem or whether they are common. This question will also provide insights into how maintainers deal with code clones on different ecosystems. One part of this question will focus on the code reusability practices used in different CI/CD tools. Understanding and comparing the practices used in different platforms will allow us to understand why a given practice is used in a single CI/CD, or to the contrary, common to all. In doing so, we expect to identify how code reusability systems are used in different CI/CD tools, and to formulate various advices helping developers of different platforms. Moreover, the content and semantic behind detected clones will be analysed and compared. Such an analysis may result in recommendations, easing the task of creating and maintaining CI/CD configura- tion files. As a first step, this question will focus on GitLab CI/CD. Similarly to GitHub Actions, GitLab CI/CD is directly integrated into GitLab via a YAML configuration files located inside the repository. Whereas the main ideas are the same, they are structured differently in a GitLab configuration file. Gitea is another social coding platform integrating a CI/CD. It is of particular interest in this question as its configuration files follow the same syntax of GitHub Actions, though some differences may occur.12 Other CI/CD will be included in this research questions depending on the time left in my thesis. This last research question will later be divided into multiple research questions. As my doctoral research has only just begun, I currently do not have enough hindsight related to provide details related to this question. 4. Progress and Future Work This section explains my current progress for the different goals. 4.1. Dataset A preliminary goal of my thesis is to gather a large historical dataset of GitHub workflow files, allowing us to quantitatively study the code duplication and reuse in the context of workflow files. To this end, I developed and published gigawork (which is an acronym for “Give me GitHub Actions Workflows”), an open-source extraction tool for extracting workflow files from a git repository.13 gigawork traverses the git history tree beginning from a given git reference, git HEAD by default, following the first-parent rule.14 For each commit touching a workflow file, it extracts the workflow file version as well as other metadata (such as the commit date, the author, ...) [52]. gigawork was applied on a list of 43K+ repositories, obtained via a query to SEART search engine [53] (where we excluded repositories not using GitHub Actions). In an updated version of the dataset [52], 2.5M+ workflow files, representing 219K+ workflow histories, are present. In addition, boolean flags indicating which workflow files of the dataset are valid YAML files and which respect the current GitHub Actions syntax were added by gigawork. 12 https://docs.gitea.com/next/usage/actions/comparison 13 https://github.com/cardoeng/gigawork 14 This rule dictates that git only follows the first parent of each commit. Figure 1: A visual diff of two workflows found in GitHub repositories. Although the dataset does not contain any information about code clones, we can use it to detect such code clones. For instance, two very similar workflow files, extracted from the dataset, can be found at Figure 1.15 Both workflows are very similar to the one presented at Listing 1, though they are coming from different repositories. Moreover, they also have the same goal. As seen on Figure 1, the main changes between these two workflows are comments and version changes. We can also note the addition of a new python dependency (virtualenv). Finally, a limit of depth for fetching the repository was added via fetch-depth variable. 4.2. Code clones detection In the goal of answering RQ1, multiple ways of detecting code clones in GitHub Actions workflow files were already studied and experimented. However, no definitive algorithm was found yet. I am thus currently focussing on RQ1, continuing experimenting different approaches in the goal of efficiently detecting code clones, while keeping in mind this algorithm might be extended to other YAML files. References [1] GitHub, Github actions now supports ci/cd, free for public repositories, https://github.blog/ 2019-08-08-github-actions-now-supports-ci-cd/, 2019. [Online; accessed 21 June 2024]. [2] GitHub, Octoverse: The state of open source and rise of ai in 2023, https://github.blog/ 2023-11-08-the-state-of-open-source-and-ai/, 2023. [Online; accessed 13 May 2024]. [3] GitLab, Gitlab 8.0 released with new looks and integrated ci!, https://about.gitlab.com/releases/ 2015/09/22/gitlab-8-0-released/, 2015. [Online; accessed 21 June 2024]. [4] GitLab, The most-comprehensive ai-powered devsecops platform, https://about.gitlab.com/, 2024. [Online; accessed 22 June 2024]. [5] M. L. Gupta, R. Puppala, V. V. Vadapalli, H. Gundu, C. Karthikeyan, Continuous integration, 15 Left side corresponds to https://github.com/instrumentkit/InstrumentKit/blob/main/.github/workflows/deploy.yml, right side corresponds to https://github.com/ASUS-AICS/LibMultiLabel/blob/master/.github/workflows/python-publish.yml. Both represents the latest version available in the dataset. delivery and deployment: A systematic review of approaches, tools, challenges and practices, in: International Conference on Recent Trends in AI Enabled Technologies, Springer, 2024, pp. 76–89. [6] A. Decan, T. Mens, P. R. Mazrae, M. Golzadeh, On the use of GitHub Actions in software develop- ment repositories, in: International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2022. [7] C. Parnin, E. Helms, C. Atlee, H. Boughton, M. Ghattas, A. Glover, J. Holman, J. Micco, B. Murphy, T. Savor, et al., The top 10 adages in continuous deployment, IEEE Software 34 (2017) 86–95. [8] M. Fowler, Refactoring: improving the design of existing code, Addison-Wesley Professional, 2018. [9] N. Saini, S. Singh, et al., Code clones: Detection and management, Procedia computer science 132 (2018) 718–727. [10] M. A. Oumaziz, J.-R. Falleri, X. Blanc, T. F. Bissyandé, J. Klein, Handling duplicates in dockerfiles families: Learning from experts, in: International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2019, pp. 524–535. [11] M. A. Oumaziz, Cloning beyond source code: a study of the practices in API documentation and infrastructure as code., Ph.D. thesis, Bordeaux, 2020. [12] S. McIntosh, M. Poehlmann, E. Juergens, A. Mockus, B. Adams, A. E. Hassan, B. Haupt, C. Wagner, Collecting and leveraging a benchmark of build system clones to aid in quality assessments, in: International Conference on Software Engineering, 2014, pp. 145–154. [13] G. Cardoen, Une analyse empirique de la duplication de code dans les CI/CD workflows sur GitHub, Master’s thesis, University of Mons, 2023. [14] C. W. Krueger, Software reuse, ACM Computing Surveys (CSUR) 24 (1992) 131–183. [15] S. Haefliger, G. Von Krogh, S. Spaeth, Code reuse in open source software, Management science 54 (2008) 180–193. [16] J. Krüger, T. Berger, An empirical analysis of the costs of clone-and platform-oriented software reuse, in: ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2020, pp. 432–444. [17] M. Zakeri-Nasrabadi, S. Parsa, M. Ramezani, C. Roy, M. Ekhtiarzadeh, A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges, Journal of Systems and Software (2023) 111796. [18] C. K. Roy, J. R. Cordy, A survey on software clone detection research, Queen’s School of computing TR 541 (2007) 64–68. [19] Q. U. Ain, W. H. Butt, M. W. Anwar, F. Azam, B. Maqbool, A systematic review on code clone detection 7 (2019) 86121–86144. [20] M. Lei, H. Li, J. Li, N. Aundhkar, D.-K. Kim, Deep learning application on code clone detection: A review of current knowledge, Journal of Systems and Software 184 (2022) 111141. [21] A. Sheneamer, J. Kalita, A survey of software clone detection techniques, International Journal of Computer Applications 137 (2016) 1–21. [22] C. J. Kapser, M. W. Godfrey, “cloning considered harmful” considered harmful: patterns of cloning in software, Empirical Software Engineering 13 (2008) 645–692. [23] E. Juergens, F. Deissenboeck, B. Hummel, S. Wagner, Do code clones matter?, in: International Conference on Software Engineering, IEEE, 2009, pp. 485–495. [24] C. K. Roy, M. F. Zibran, R. Koschke, The vision of software clone management: Past, present, and future (keynote paper), in: Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), IEEE, 2014, pp. 18–33. [25] C. K. Roy, J. R. Cordy, Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization, in: International Conference on Program Comprehension, IEEE, 2008, pp. 172–181. [26] T. Kamiya, S. Kusumoto, K. Inoue, CCFinder: A multilinguistic token-based code clone detection system for large scale source code, IEEE Transactions on Software Engineering 28 (2002) 654–670. [27] Y. Semura, N. Yoshida, E. Choi, K. Inoue, CCFinderSW: Clone detection tool with flexible multilin- gual tokenization, in: 2017 24th Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2017, pp. 654–659. [28] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, C. V. Lopes, SourcererCC: Scaling code clone detection to big-code, in: Proceedings of the 38th international conference on software engineering, 2016, pp. 1157–1168. [29] L. Jiang, G. Misherghi, Z. Su, S. Glondu, Deckard: Scalable and accurate tree-based detection of code clones, in: 29th International Conference on Software Engineering (ICSE’07), IEEE, 2007, pp. 96–105. [30] M. Wang, P. Wang, Y. Xu, Ccsharp: An efficient three-phase code clone detector using modified pdgs, in: 2017 24th Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2017, pp. 100–109. [31] F. Arcelli Fontana, M. Zanoni, F. Zanoni, A duplicated code refactoring advisor, in: Agile Processes in Software Engineering and Extreme Programming, Springer, 2015, pp. 3–14. [32] V. Gayoso Martínez, F. Hernández Álvarez, L. Hernández Encinas, State of the art in similarity preserving hashing functions (2014). [33] L. Lulu, B. Belkhouche, S. Harous, Overview of fingerprinting methods for local text reuse detection, in: 2016 12th International Conference on Innovations in Information Technology (IIT), IEEE, 2016, pp. 1–6. [34] J. Martinez-Gil, Source code clone detection using unsupervised similarity measures, in: Interna- tional Conference on Software Quality, Springer, 2024, pp. 21–37. [35] J. R. Pate, R. Tairas, N. A. Kraft, Clone evolution: A systematic review, Journal of software: Evolution and Process 25 (2013) 261–283. [36] M. Mondal, B. Roy, C. K. Roy, K. A. Schneider, An empirical study on bug propagation through code cloning, Journal of Systems and Software 158 (2019) 110407. [37] P. Estefó, R. Robbes, J. Fabry, Code duplication in ROS launchfiles, in: International Conference of the Chilean Computer Science Society (SCCC), IEEE, 2015, pp. 1–6. [38] GitHub, Octoverse 2022: The state of open source software, https://octoverse.github.com/2022/ developer-community, 2022. [Online; accessed 13 May 2024]. [39] M. Golzadeh, A. Decan, T. Mens, On the rise and fall of CI services in GitHub, in: International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2022. [40] T. Kinsman, M. Wessel, M. A. Gerosa, C. Treude, How do software developers use GitHub Actions to automate their workflows?, in: International Conference on Mining Software Repositories (MSR), IEEE, 2021, pp. 420–431. [41] P. Rostami Mazrae, T. Mens, M. Golzadeh, A. Decan, On the usage, co-usage and migration of CI/CD tools: A qualitative analysis, Empirical Software Engineering 28 (2023) 52. [42] P. Valenzuela-Toledo, A. Bergel, Evolution of GitHub Action workflows, in: International Confer- ence on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2022. [43] P. R. Mazrae, A. Decan, T. Mens, M. Wessel, A Preliminary Study of GitHub Actions Workflow Changes, in: CEUR Workshop Proceedings, 2023. [44] S. G. Saroar, M. Nayebi, Developers’ Perception of GitHub Actions: A Survey Analysis, in: International Conference on Evaluation and Assessment in Software Engineering (EASE), ACM, 2023. [45] A. Khatami, C. Willekens, A. Zaidman, Catching Smells in the Act: A GitHub Actions Workflow Investigation, in: International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 2024. [46] J. W. Creswell, Mixed-method research: Introduction and application, in: Handbook of educational policy, Elsevier, 1999, pp. 455–472. [47] M. Kim, V. Sazawal, D. Notkin, G. Murphy, An empirical study of code clone genealogies, in: European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering, 2005, pp. 187–196. [48] J. Krinke, A study of consistent and inconsistent changes to code clones, in: Working Conference on Reverse Engineering (WCRE), IEEE, 2007, pp. 170–178. [49] P. R. Mazrae, A. Decan, T. Mens, gawd: A differencing tool for GitHub Actions workflows, in: International Conference on Mining Software Repositories, ACM, 2024. [50] S. Thummalapenta, L. Cerulo, L. Aversano, M. Di Penta, An empirical study on the maintenance of source code clones, Empirical Software Engineering 15 (2010) 1–34. [51] R. Di Cosmo, S. Zacchiroli, Software heritage: Why and how to preserve software source code, in: International Conference on Digital Preservation, 2017, pp. 1–10. [52] G. Cardoen, T. Mens, A. Decan, A dataset of GitHub Actions workflow histories, in: International Conference on Mining Software Repositories, ACM, 2024. [53] O. Dabic, E. Aghajani, G. Bavota, Sampling projects in GitHub for MSR studies, in: International Conference on Mining Software Repositories (MSR), IEEE, 2021, pp. 560–564.