On the Effect of Discussions on Pull Request Decisions Mehdi Golzadeh, Alexandre Decan, Tom Mens Software Engineering Lab, University of Mons Mons, Belgium {mehdi.golzadeh, alexandre.decan, tom.mens}@umons.ac.be social phenomenon [1, 2]. GitHub embraces this social nature by extending the traditional git workflow with Abstract collaboration mechanisms such as pull requests (PR) and commenting. The pull-based development pro- Open-source software relies on contributions cess [3] constitutes the primary means for integrating from different types of contributors. Online code from thousands of developers. It allows devel- collaborative development platforms, such as opers to participate in many projects without having GitHub, usually provide explicit support for direct commit access. The primary advantage of a PR these contributions through the mechanism of is the decoupling of the development effort from the pull requests, allowing project members and decision to merge the result to the project’s codebase. external contributors to discuss and evaluate It helps developers to avoid frequent merge conflicts the submitted code. These discussions can with other contributors. play an important role in the decision-making Through a built-in commenting mechanism, project process leading to the acceptance or rejection integrators can review the code submitted in a PR, and of a pull request. We empirically examine in ask contributors to improve their code, add documen- this paper 183K pull requests and their dis- tation and tests before deciding to integrate it [4, 5]. cussions, for almost 4.8K GitHub repositories Therefore, the history of commenting activity on a PR for the Cargo ecosystem. We investigate the (including all pull request comments and pull request prevalence of such discussions, their partici- review comments) provides a valuable source of infor- pants and their size in terms of messages and mation. It enables analysis of who was involved in the durations, and study how these aspects relate discussion about a PR (e.g. the PR creator, project to pull request decisions. integrators, or other contributors). The discussions Index terms— collaborative development, pull re- that take place between the author of the PR and the quests, discussions, software repository mining, empir- project integrators may play a key role in the ultimate ical analysis1 decision to merge the PR into the code base, if the con- cerns raised by the project integrators were properly addressed or discussed carefully by the PR author. 1 Introduction While many studies have focused on the importance Today’s open source software development is increas- of having successful PRs [6–9], there is much less re- ingly relying on third-party contributors. Developers search on understanding the effect of the presence of contribute to different projects on online distributed discussions on the decision to accept or reject a PR. development platforms like GitHub. The collabora- Our research aims to empirically study the relation tive nature of software development it an inherently between the PR commenting history and the final PR decision. As preliminary steps, we focus in this paper Copyright © by the paper’s authors. Use permitted under Cre- ative Commons License Attribution 4.0 International (CC BY on three research questions: 4.0). RQ1 How prevalent are discussions in PRs? helps In: D. Di Nucci, C. De Roover (eds.): Proceedings of the 18th us to determine whether the research goal is worth- Belgium-Netherlands Software Evolution Workshop, Brussels, while to pursue: if there is only a limited number Belgium, 28-11-2019, published at http://ceur-ws.org 1 This research is supported by the joint FNRS / FWO of PRs with discussions, then we will not be able to Excellence of Science project SECO-ASSIST and FNRS PDR draw statistically significant conclusions on their re- T.0017.18. lation with PR decisions. We show that most PRs 1 have at least a few comments and a few participants effect of organization and developer profiles on the PR involved in their discussions, and that the presence of decision [7]. a discussion is related to the decision. In RQ2 Who is involved in PR discussions? we identify and group 3 Methodology participants based on their role in a PR. We report about their combined presence in discussions and ex- To carry out our empirical investigation, we need a hibit a relation between a PR decision and the partic- dataset containing a large number of repositories and ipants that are involved in its discussion. Finally, in PRs. The dataset should exclude git repositories that RQ3 How long are discussions? we measure discussion have been created merely for experimental or personal length in terms of time and of number of comments reasons, or that only show sporadic traces of activity and show how they relate to a PR decision. and contributions [28]. Registries of reusable software The remainder of this paper is organized as follows. packages (e.g., npm for JavaScript, Cargo for Rust, Section 2 provides the necessary background of studies or PyPI for Python) are good candidates to find such related to PRs and comments. Section 3 presents the repositories, as they typically host thousands of active data extraction and methodology. Section 4 presents software projects, and as one can expect most of them the preliminary results for the above research ques- to have an associated git repository. tions. Section 5 discusses the threats to validity of our We selected the Cargo package registry for the Rust study. Section 6 summarises the main findings and programming language, because it contains tens of outlines future work. thousands of projects, and a large majority of them (nearly 85%) is being developed on GitHub. As both Cargo and Rust are quite recent (Rust was introduced 2 Background in 2011), they contain a large number of repositories, Distributed software development on shared online even after filtering out those that are inactive in terms GitHub repositories is very frequently following a pull- of contributions and discussions related to these con- based development process [3–5]. Any contributor can tributions. create forks of a repository, update them locally by We relied on libraries.io data dump to extract the contributing code changes and, whenever ready, re- metadata for more than 15K Cargo packages [29]. We quest to have these changes merged back into the main filtered out 1,571 packages that did not have any as- branch by submitting a PR [10]. This pull-based soft- sociated git repository and 413 packages whose repos- ware development model offers a distributed collabo- itory is not hosted on GitHub. Not all git reposito- ration mechanism that allows developers to contribute ries were still available at the time we extracted the code in a way that makes code changes trackable data, and our final list of repositories is composed of and reviewable by version control systems. This re- 9,954 candidates. For each of these repositories, we view mechanism has the additional effect of increasing retrieved using GitHub API its complete list of PRs awareness of all changes and allows the developer com- and, for each PR, all related comments and PR review munity to form an opinion about the proposed changes comments. We found that 5,210 repositories did not and the ultimate merge decision [11]. Many empiri- have any PRs, hence only 4,744 repositories were re- cal studies have targeted pull requests from different tained for further analysis, accounting for more than points of view, including evaluation of PRs through 188K PRs. discussion [6], factors influencing acceptance or rejec- As our goal is to study the relation between discus- tion [8, 9, 12, 13] and, predicting potential future con- sions and PR decisions, we decided to remove all PRs tributors [14]. for which no decision was (yet) taken. Such PRs repre- Moreover, there are studies which analyze the con- sent a small fraction of our dataset (around 2.6%). Our tent of PR to recommend core member to review, an- final dataset contains more than 183K PRs, submitted alyze, evaluate and integrate PRs [15–19], recommend by 13,623 contributors and accounting for nearly 1M PRs with high priority [20], study the effect of ge- comments. ographical location of contributors on evaluation of For each PR in this dataset, we have access to its PRs [21], and gender bias in PR acceptance or re- creation date, its decision date, its decision, the per- jection [22]. Some studies targeted code reviews to son that made that decision, the author of the PR, study the reasons and impact of confusion in code and all the comments that were made, including PR reviews [23], linguistic aspects of code review com- review comments. It is important to note that the very ments [24], the impact of continuous integration on first comment visible in a PR corresponds to the PR code reviews [25], the challenges faced by code change description, and is not considered as a PR comment authors and reviewers [26], how developers perceive in this paper, following the distinction also made by code review quality [27], how presence of bots and the GitHub. For each comment, we retrieved its creation 2 date and its owner. We distinguish between four cat- comment (has comments), at least two participants egories of owners: (has participants) and at least one comment exchange (has exchange). Fig. 2 reports on these proportions. 1. author corresponds to the contributor submitting Note that by definition a comment exchange implies at the PR; least 2 participants, hence we have has exchange =⇒ 2. integrator refers to the person having accepted or has participants =⇒ has comments. rejected a previous PR in the same project; 1.0 has comments 0.8 has participants proportion of PRs 3. decider refers to the integrator who accepted or has exchange rejected the PR currently under consideration; 0.6 and 0.4 4. other corresponds to any other participant (e.g., 0.2 users, bots, external contributors). 0.0 Accepted Rejected 4 Research Results RQ1 How prevalent are discussions in PRs? Figure 2: Proportion of accepted and rejected PRs w.r.t. the presence of comments and participants. With this first research question, we aim to get in- sights into the prevalence of discussions in PRs. For While we observe that a majority of PRs (regard- each PR in the dataset, we computed its number of less of their decision) have comments, proportionally comments, its number of distinct participants and its more PRs have comments for rejected PRs (72.5%) number of comment exchanges between one of the inte- than for accepted ones (62.4%). Similar observations grators and the author, i.e., the number of times there can be made for the other criteria, suggesting a re- is one comment from an integrator followed by an an- lation between PR acceptance and the presence of a swer from the PR author. Fig. 1 shows the proportion comment/participant. of PRs having at least a given number of comments, participants, and comment exchanges. RQ2 Who is involved in PR discussions? 1.0 This research question focuses on the participants that comments are involved in PR discussions. We distinguish be- 0.8 participants proportion of PRs tween four categories of participants, as explained in comment exchanges 0.6 Section 3. For each PR, each participant involved in the discussion was classified in author, integrator, de- 0.4 cider or other. Fig. 3 shows the proportion of PR 0.2 discussions in function of the presence of categories of 0.0 participants. 0 3 6 9 12 15 18 21 24 We observe that the author of a min. number of comments, participants or exchanges PR is involved in most discussions (64%=6+12+3+3+3+4+20+13), as is the case Figure 1: Proportion of PRs having at least a given for deciders (62%=11+9+20+12+3+4+1+2) number of comments, participants or comment ex- and integrators (57%=6+9+1+1+3+4+20+13). changes. Other participants are involved in only 23% We observe that while 48.8% of all PRs have at least (=2+1+4+3+3+3+1+6) of the discussions. We two comments and 42.4% of all PRs have at least two participants, only 31.9% of them have comment ex- changes. We also observe that all curves exhibit power law behaviour: the proportion of PRs is exponentially decreasing as the required number of comments, par- ticipants or exchanges increases. For instance, around 80% of all PRs have less than 8 comments, 3 partici- pants and 2 comment exchanges. Since the presence of comments, participants and/or comment exchanges could affect the acceptance or rejection of a PR, we computed the proportion of Figure 3: Proportion of PR discussions w.r.t. the pres- accepted (resp. rejected) PRs that have at least one ence of participants. 3 observe that the most frequent combinations of partic- ipants involve the author and some integrator/decider. 500 Accepted 400 duration (in days) For instance, the pair composed of author/integrator Rejected is the most frequent one (40%=13+20+4+3) followed 300 by the pair author/integrator (39%=20+12+4+3). 200 24% (=20+4) of the discussions involve the author, 100 an integrator and the decider. 29% (=6+6+11+6) of 0 all cases involve a single participant only. 0 20 40 60 80 100 120 140 Similar to what was done for RQ1 , we grouped PRs number of comments according to their decision, and we computed the pro- portion of PRs with respect to the presence of partic- Figure 5: Scatter plot and density plots of discussion ipants of each category. Fig. 4 reports on these pro- duration and number of comments. portions. number of comments and the duration. We statisti- 1.0 cally compared these distributions by means of Mann- discussion with Whitney-U tests. The null hypothesis was rejected in 0.8 author decider proportion of PRs integrator other both cases (p < 0.001), indicating a statistically sig- 0.6 nificant difference between these distributions. How- 0.4 ever, we found this difference to be negligible (Cliff’s delta |d| = 0.025) for the number of comments [30,31], 0.2 and small (|d| = 0.219) for the duration of these dis- 0.0 cussions, indicating a higher duration in rejected PRs Accepted Rejected than in accepted ones. For instance, the median dura- tion is 1.69 days for rejected PRs and 0.6 for accepted Figure 4: Proportion of PRs w.r.t. participants, ones. grouped by PR decision. The two regression lines superposed on the scatter We observe some interesting differences between ac- plot reflect the average time between comments (i.e., cepted and rejected PRs mainly based on the presence the ratio between duration and comments). We com- of authors and integrators. 51.4% of rejected PRs in- puted this ratio for all considered discussions, and we volve the author of that PR and 49.6% involve an in- statistically compared their distributions for accepted tegrator, while for accepted PRs only 39.1% involve and rejected PRs using a Mann-Whitney-U test. We the author and 34.3% involve an integrator. While in- found a statistically significant difference between the tegrators are proportionally more involved in rejected two distributions (p < 0.001) and a small effect size than accepted PRs, the opposite is true when it comes (|d| = 0.258), indicating a higher discussion ratio in to the decider of a PR: a decider is involved in 42.6% accepted PRs than in rejected PRs. For instance, the of accepted PRs but “only” in 22.0% of the rejected median average time between comments is 0.08 for ac- ones. Finally, when considering all other participants cepted PRs, and 0.26 for rejected PRs. there is only a slight difference between accepted PRs (14.4%) and rejected PRs (17.4%). 5 Threats to Validity Since our analyses are based on data from git reposi- RQ3 How long are discussions? tories on GitHub, our results may be exposed to the The last research question focuses on the length of dis- usual threats related to mining data from GitHub such cussions in terms of number of comments and time be- as “a large portion of repositories are not for soft- tween the first and last comment. We computed these ware development” and “two thirds of projects are per- two characteristics for discussions having at least 2 sonal” [28]. However, given that our dataset is com- comments. These account for 49% of all PRs consid- posed of git repositories related to Cargo projects, it is ered so far. The results are reported in Fig. 5, combin- unlikely to be affected by such threats. On the other ing a scatter plot and two density plots (one for each hand, the selection bias induced by our dataset be- considered characteristic). ing exclusively based on repositories related to Cargo We observe from the density plots that most discus- projects is a threat to external validity [32], since the sions have a few comments and last for a short period results and conclusions cannot be generalized outside of time. For instance, the median number of com- the scope of this study. ments is 5 and the median duration is 0.7 days. We The main threat to construct validity is that “most observe from the scatter plot a difference between dis- pull requests appear as non-merged even if they are cussions in accepted and rejected PRs, both for the actually merged” [28], potentially leading to an over- 4 estimation of the number of rejected PRs to the detri- This paper is part of a broader study and our inten- ment of accepted ones. Fully addressing this threat tion is to gain a deeper understanding of the dynamics is not possible, but we could rely on heuristics to de- and patterns of discussions in pull requests, and their tect whether PR commits are actually part of the main impact on PR decisions. Our goal is to provide tech- branch. Such heuristics are likely to change the figures niques and tools to allow the community to perform reported in this paper, but are unlikely to affect the better. Reducing the time to make decisions for pull findings we obtained. Indeed, even if some PRs were requests can help the community to encourage better wrongly identified as non-merged (=rejected), we al- contributions by reducing the time required to reject ready exhibited differences in PR discussions between contributions of insufficient quality or relevance, and accepted and rejected PRs. by reducing the time to review and accept positive con- Another threat to construct validity stems from the tributions. Moreover, based on the insights obtained presence of bots and contributors with multiple iden- during this study we aim to develop techniques to in- tities. We mitigated the problem of multiple identi- crease the productivity of contributions in terms of ties by relying on GitHub usernames to identify con- code quality and contribution time. tributors instead of the “author” field values. We did not consider the presence of bots in this work. This References may have led to an overestimation of the number of comments and participants, but our findings should [1] Laura A. Dabbish, H. Colleen Stuart, Jason Tsay, not be significantly affected, assuming that bots rep- and James D. Herbsleb. Social coding in GitHub: resent only a fraction of the considered comments. In transparency and collaboration in an open soft- our future work, we will study heuristics to detect bot ware repository. In Int’l Conf. Computer Sup- comments in order to take them into account in our ported Cooperative Work, pages 1277–1286, 2012. analyses. [2] Tom Mens, Marcelo Cataldo, and Daniela Finally, the lack of distinction between the different Damian. The social developer: The future of soft- types of comments in our dataset represents a threat ware development. IEEE Software, 36, January– to internal validity. Not all comments are equal, but February 2019. have been treated as such in this work. We did not differentiate based on the size or content of the com- [3] Georgios Gousios, Martin Pinzger, and Arie van ments. Similarly, we did not distinguish between PR Deursen. An exploratory study of the pull-based comments and PR review comments, even if they do software development model. In International not serve the same purpose. Making such distinctions Conference on Software Engineering, pages 345– can potentially lead to different results, and will be 355. ACM, 2014. explored in future work to gain additional insights. [4] G. Gousios, A. Zaidman, M. Storey, and Arie van Deursen. Work practices and challenges in pull- 6 Conclusion based development: The integrator’s perspective. In this preliminary research, we empirically studied In International Conference on Software Engi- 183K PRs and their discussions, accounting for around neering, volume 1, pages 358–368. IEEE, May 1M comments. We showed that discussions are preva- 2015. lent in PRs and there are proportionally more com- [5] Georgios Gousios, Margaret-Anne Storey, and Al- ments, participants and comment exchanges for re- berto Bacchelli. Work practices and challenges in jected PRs than for accepted ones. We identified and pull-based development: The contributor’s per- grouped participants based on their role in a PR, and spective. In International Conference on Software showed that a majority of discussions involved the au- Engineering, pages 285–296. ACM, 2016. thor, the decider or one of the integrators. We showed that the presence of these participants is related to PR [6] Jason Tsay, Laura Dabbish, and James Herb- decisions. sleb. Let’s talk about it: Evaluating contribu- Finally, we considered discussion length in terms tions through discussion in github. In Proceedings of duration and number of comments. We observed of the 22Nd ACM SIGSOFT International Sym- that most discussions have only a few comments and posium on Foundations of Software Engineering, do not last for long. While we have not found large FSE 2014, pages 144–154, New York, NY, USA, differences between accepted and rejected PRs based 2014. ACM. on their number of comments, we found that discus- sions in rejected PRs are longer, and that discussions [7] Olga Baysal, Oleksii Kononenko, Reid Holmes, in accepted PRs are more intense. and Michael W. Godfrey. Investigating techni- 5 cal and non-technical factors influencing mod- In Asia-Pacific Software Engineering Conference, ern code review. Empirical Software Engineering, volume 1, pages 335–342, Dec 2014. 21(3):932–959, Jun 2016. [17] Manoel Limeira de Lima Júnior, Daricélio Mor- [8] Mohammad Masudur Rahman and Chanchal K. eira Soares, Alexandre Plastino, and Leonardo Roy. An insight into the pull requests of Murta. Developers assignment for analyzing pull GitHub. In Working Conference on Mining Soft- requests. In ACM Symposium on Applied Com- ware Repositories, pages 364–367. ACM, 2014. puting, pages 1567–1572. ACM, 2015. [9] Di Chen, Kathryn T. Stolee, and Tim Menzies. [18] Jing Jiang, J.-H He, and X.-Y Chen. Corede- Replication can improve prior results: A github vrec: Automatic core member recommendation study of pull request acceptance. In Proceedings for contribution evaluation. Journal of Computer of the 27th International Conference on Program Science and Technology, 30:998–1016, 09 2015. Comprehension, ICPC ’19, pages 179–190, Pis- cataway, NJ, USA, 2019. IEEE Press. [19] Manoel Limeira de Lima Júnior, Daricélio Mor- eira Soares, Alexandre Plastino, and Leonardo [10] Y. Yu, H. Wang, V. Filkov, P. Devanbu, and Murta. Automatic assignment of integrators to B. Vasilescu. Wait for it: Determinants of pull pull requests: The importance of selecting appro- request evaluation latency on GitHub. In Work- priate attributes. Journal of Systems and Soft- ing Conference on Mining Software Repositories, ware, 144:181 – 196, 2018. pages 367–371, May 2015. [20] E. v. d. Veen, G. Gousios, and A. Zaidman. Au- [11] Jason Tsay, Laura Dabbish, and James Herbsleb. tomatically prioritizing pull requests. In Work- Influence of social and technical factors for eval- ing Conference on Mining Software Repositories, uating contribution in GitHub. In International pages 357–361. IEEE, May 2015. Conference on Software Engineering, pages 356– 366. ACM, 2014. [21] Ayushi Rastogi, Nachiappan Nagappan, Georgios Gousios, and André van der Hoek. Relationship [12] Igor Steinmacher, Gustavo Pinto, Igor Scaliante between geographical location and evaluation of Wiese, and Marco A. Gerosa. Almost there: A developer contributions in github. In Interna- study on quasi-contributors in open source soft- tional Symposium on Empirical Software Engi- ware projects. In Proceedings of the 40th Interna- neering and Measurement. ACM, 2018. tional Conference on Software Engineering, ICSE ’18, pages 256–266, New York, NY, USA, 2018. [22] Josh Terrell, Andrew Kofink, Justin Middle- ACM. ton, Clarissa Rainear, Emerson Murphy-Hill, and Chris Parnin. Gender bias in open source: Pull [13] M. Wessel, I. Steinmacher, I. Wiese, and M. A. request acceptance of women versus men. 01 2016. Gerosa. Should i stale or should i close? an anal- ysis of a bot that closes abandoned issues and [23] Felipe Ebert, Fernando Castor, Nicole Novielli, pull requests. In 2019 IEEE/ACM 1st Interna- and Alexander Serebrenik. Confusion in code re- tional Workshop on Bots in Software Engineering views: Reasons, impacts, and coping strategies. (BotSE), pages 38–42, May 2019. pages 49–60, 02 2019. [14] Damien Legay, Alexandre Decan, and Tom [24] Vasiliki Efstathiou and Diomidis Spinellis. Code Mens. On the impact of pull request decisions review comments: Language matters. CoRR, on future contributions. arXiv e-prints, page abs/1803.02205, 2018. arXiv:1812.06269, Dec 2018. [25] M. M. Rahman and C. K. Roy. Impact of con- [15] Y. Yu, H. Wang, G. Yin, and C. X. Ling. Re- tinuous integration on code reviews. In 2017 viewer recommender of pull-requests in GitHub. IEEE/ACM 14th International Conference on In International Conference on Software Mainte- Mining Software Repositories (MSR), pages 499– nance and Evolution, pages 609–612. IEEE, Sep. 502, May 2017. 2014. [26] L. MacLeod, M. Greiler, M. Storey, C. Bird, and [16] Y. Yu, H. Wang, G. Yin, and C. X. Ling. Who J. Czerwonka. Code reviewing in the trenches: should review this pull-request: Reviewer rec- Challenges and best practices. IEEE Software, ommendation to expedite crowd collaboration. 35(4):34–42, July 2018. 6 [27] O. Kononenko, O. Baysal, and M. W. Godfrey. [30] N. Cliff. Dominance statistics: Ordinal analyses Code review quality: How developers see it. In to answer ordinal questions. Psychological Bul- 2016 IEEE/ACM 38th International Conference letin, 114(3):494–509, 1993. cited By 364. on Software Engineering (ICSE), pages 1028– 1038, May 2016. [31] Jeanine Romano, Jeffrey D Kromrey, Jesse Cor- aggio, Jeff Skowronek, and Linda Devine. Explor- [28] Eirini Kalliamvakou, Georgios Gousios, Kelly ing methods for evaluating group differences on Blincoe, Leif Singer, Daniel M. German, and the NSSE and other surveys: Are the t-test and Daniela Damian. The promises and perils of Cohen’s d indices the most appropriate choices? mining GitHub. In Int’l Conf. Mining Software In Annual Meeting of the Southern Association Repositories, pages 92–101. ACM, 2014. for Institutional Research, 2006. [29] Jeremy Katz. Libraries.io open source repository [32] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, and dependency metadata (version 1.4.0) [data B. Regnell, and A. Wesslen. Experimentation in set]. http://doi.org/10.5281/zenodo.2536573, Software Engineering - An Introduction. Kluwer, 2018. 2000. 7