=Paper= {{Paper |id=Vol-2605/16 |storemode=property |title=On the Effect of Discussions on Pull Request Decisions |pdfUrl=https://ceur-ws.org/Vol-2605/16.pdf |volume=Vol-2605 |authors=Mehdi Golzadeh,Alexandre Decan,Tom Mens |dblpUrl=https://dblp.org/rec/conf/benevol/GolzadehDM19 }} ==On the Effect of Discussions on Pull Request Decisions== https://ceur-ws.org/Vol-2605/16.pdf

On the Effect of Discussions on Pull Request Decisions

Mehdi Golzadeh, Alexandre Decan, Tom Mens
Software Engineering Lab, University of Mons
Mons, Belgium
{mehdi.golzadeh, alexandre.decan, tom.mens}@umons.ac.be

social phenomenon [1, 2]. GitHub embraces this social
nature by extending the traditional git workflow with
Abstract collaboration mechanisms such as pull requests (PR)
and commenting. The pull-based development pro-
Open-source software relies on contributions cess [3] constitutes the primary means for integrating
from different types of contributors. Online code from thousands of developers. It allows devel-
collaborative development platforms, such as opers to participate in many projects without having
GitHub, usually provide explicit support for direct commit access. The primary advantage of a PR
these contributions through the mechanism of is the decoupling of the development effort from the
pull requests, allowing project members and decision to merge the result to the project’s codebase.
external contributors to discuss and evaluate It helps developers to avoid frequent merge conflicts
the submitted code. These discussions can with other contributors.
play an important role in the decision-making
Through a built-in commenting mechanism, project
process leading to the acceptance or rejection
integrators can review the code submitted in a PR, and
of a pull request. We empirically examine in
ask contributors to improve their code, add documen-
this paper 183K pull requests and their dis-
tation and tests before deciding to integrate it [4, 5].
cussions, for almost 4.8K GitHub repositories
Therefore, the history of commenting activity on a PR
for the Cargo ecosystem. We investigate the
(including all pull request comments and pull request
prevalence of such discussions, their partici-
review comments) provides a valuable source of infor-
pants and their size in terms of messages and
mation. It enables analysis of who was involved in the
durations, and study how these aspects relate
discussion about a PR (e.g. the PR creator, project
to pull request decisions.
integrators, or other contributors). The discussions
Index terms— collaborative development, pull re- that take place between the author of the PR and the
quests, discussions, software repository mining, empir- project integrators may play a key role in the ultimate
ical analysis1 decision to merge the PR into the code base, if the con-
cerns raised by the project integrators were properly
addressed or discussed carefully by the PR author.
1 Introduction
While many studies have focused on the importance
Today’s open source software development is increas- of having successful PRs [6–9], there is much less re-
ingly relying on third-party contributors. Developers search on understanding the effect of the presence of
contribute to different projects on online distributed discussions on the decision to accept or reject a PR.
development platforms like GitHub. The collabora- Our research aims to empirically study the relation
tive nature of software development it an inherently between the PR commenting history and the final PR
decision. As preliminary steps, we focus in this paper
Copyright © by the paper’s authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY on three research questions:
4.0). RQ1 How prevalent are discussions in PRs? helps
In: D. Di Nucci, C. De Roover (eds.): Proceedings of the 18th us to determine whether the research goal is worth-
Belgium-Netherlands Software Evolution Workshop, Brussels, while to pursue: if there is only a limited number
Belgium, 28-11-2019, published at http://ceur-ws.org
1 This research is supported by the joint FNRS / FWO of PRs with discussions, then we will not be able to
Excellence of Science project SECO-ASSIST and FNRS PDR draw statistically significant conclusions on their re-
T.0017.18. lation with PR decisions. We show that most PRs

1
have at least a few comments and a few participants effect of organization and developer profiles on the PR
involved in their discussions, and that the presence of decision [7].
a discussion is related to the decision. In RQ2 Who
is involved in PR discussions? we identify and group 3 Methodology
participants based on their role in a PR. We report
about their combined presence in discussions and ex- To carry out our empirical investigation, we need a
hibit a relation between a PR decision and the partic- dataset containing a large number of repositories and
ipants that are involved in its discussion. Finally, in PRs. The dataset should exclude git repositories that
RQ3 How long are discussions? we measure discussion have been created merely for experimental or personal
length in terms of time and of number of comments reasons, or that only show sporadic traces of activity
and show how they relate to a PR decision. and contributions [28]. Registries of reusable software
The remainder of this paper is organized as follows. packages (e.g., npm for JavaScript, Cargo for Rust,
Section 2 provides the necessary background of studies or PyPI for Python) are good candidates to find such
related to PRs and comments. Section 3 presents the repositories, as they typically host thousands of active
data extraction and methodology. Section 4 presents software projects, and as one can expect most of them
the preliminary results for the above research ques- to have an associated git repository.
tions. Section 5 discusses the threats to validity of our We selected the Cargo package registry for the Rust
study. Section 6 summarises the main findings and programming language, because it contains tens of
outlines future work. thousands of projects, and a large majority of them
(nearly 85%) is being developed on GitHub. As both
Cargo and Rust are quite recent (Rust was introduced
2 Background in 2011), they contain a large number of repositories,
Distributed software development on shared online even after filtering out those that are inactive in terms
GitHub repositories is very frequently following a pull- of contributions and discussions related to these con-
based development process [3–5]. Any contributor can tributions.
create forks of a repository, update them locally by We relied on libraries.io data dump to extract the
contributing code changes and, whenever ready, re- metadata for more than 15K Cargo packages [29]. We
quest to have these changes merged back into the main filtered out 1,571 packages that did not have any as-
branch by submitting a PR [10]. This pull-based soft- sociated git repository and 413 packages whose repos-
ware development model offers a distributed collabo- itory is not hosted on GitHub. Not all git reposito-
ration mechanism that allows developers to contribute ries were still available at the time we extracted the
code in a way that makes code changes trackable data, and our final list of repositories is composed of
and reviewable by version control systems. This re- 9,954 candidates. For each of these repositories, we
view mechanism has the additional effect of increasing retrieved using GitHub API its complete list of PRs
awareness of all changes and allows the developer com- and, for each PR, all related comments and PR review
munity to form an opinion about the proposed changes comments. We found that 5,210 repositories did not
and the ultimate merge decision [11]. Many empiri- have any PRs, hence only 4,744 repositories were re-
cal studies have targeted pull requests from different tained for further analysis, accounting for more than
points of view, including evaluation of PRs through 188K PRs.
discussion [6], factors influencing acceptance or rejec- As our goal is to study the relation between discus-
tion [8, 9, 12, 13] and, predicting potential future con- sions and PR decisions, we decided to remove all PRs
tributors [14]. for which no decision was (yet) taken. Such PRs repre-
Moreover, there are studies which analyze the con- sent a small fraction of our dataset (around 2.6%). Our
tent of PR to recommend core member to review, an- final dataset contains more than 183K PRs, submitted
alyze, evaluate and integrate PRs [15–19], recommend by 13,623 contributors and accounting for nearly 1M
PRs with high priority [20], study the effect of ge- comments.
ographical location of contributors on evaluation of For each PR in this dataset, we have access to its
PRs [21], and gender bias in PR acceptance or re- creation date, its decision date, its decision, the per-
jection [22]. Some studies targeted code reviews to son that made that decision, the author of the PR,
study the reasons and impact of confusion in code and all the comments that were made, including PR
reviews [23], linguistic aspects of code review com- review comments. It is important to note that the very
ments [24], the impact of continuous integration on first comment visible in a PR corresponds to the PR
code reviews [25], the challenges faced by code change description, and is not considered as a PR comment
authors and reviewers [26], how developers perceive in this paper, following the distinction also made by
code review quality [27], how presence of bots and the GitHub. For each comment, we retrieved its creation

2
date and its owner. We distinguish between four cat- comment (has comments), at least two participants
egories of owners: (has participants) and at least one comment exchange
(has exchange). Fig. 2 reports on these proportions.
1. author corresponds to the contributor submitting Note that by definition a comment exchange implies at
the PR; least 2 participants, hence we have has exchange =⇒
2. integrator refers to the person having accepted or has participants =⇒ has comments.
rejected a previous PR in the same project; 1.0
has comments
0.8 has participants

proportion of PRs
3. decider refers to the integrator who accepted or
has exchange
rejected the PR currently under consideration; 0.6
and 0.4
4. other corresponds to any other participant (e.g., 0.2
users, bots, external contributors).
0.0
Accepted Rejected
4 Research Results
RQ1 How prevalent are discussions in PRs? Figure 2: Proportion of accepted and rejected PRs
w.r.t. the presence of comments and participants.
With this first research question, we aim to get in-
sights into the prevalence of discussions in PRs. For While we observe that a majority of PRs (regard-
each PR in the dataset, we computed its number of less of their decision) have comments, proportionally
comments, its number of distinct participants and its more PRs have comments for rejected PRs (72.5%)
number of comment exchanges between one of the inte- than for accepted ones (62.4%). Similar observations
grators and the author, i.e., the number of times there can be made for the other criteria, suggesting a re-
is one comment from an integrator followed by an an- lation between PR acceptance and the presence of a
swer from the PR author. Fig. 1 shows the proportion comment/participant.
of PRs having at least a given number of comments,
participants, and comment exchanges. RQ2 Who is involved in PR discussions?

1.0 This research question focuses on the participants that
comments are involved in PR discussions. We distinguish be-
0.8 participants
proportion of PRs

tween four categories of participants, as explained in
comment exchanges
0.6 Section 3. For each PR, each participant involved in
the discussion was classified in author, integrator, de-
0.4
cider or other. Fig. 3 shows the proportion of PR
0.2 discussions in function of the presence of categories of
0.0 participants.
0 3 6 9 12 15 18 21 24 We observe that the author of a
min. number of comments, participants or exchanges
PR is involved in most discussions
(64%=6+12+3+3+3+4+20+13), as is the case
Figure 1: Proportion of PRs having at least a given for deciders (62%=11+9+20+12+3+4+1+2)
number of comments, participants or comment ex- and integrators (57%=6+9+1+1+3+4+20+13).
changes. Other participants are involved in only 23%
We observe that while 48.8% of all PRs have at least (=2+1+4+3+3+3+1+6) of the discussions. We
two comments and 42.4% of all PRs have at least two
participants, only 31.9% of them have comment ex-
changes. We also observe that all curves exhibit power
law behaviour: the proportion of PRs is exponentially
decreasing as the required number of comments, par-
ticipants or exchanges increases. For instance, around
80% of all PRs have less than 8 comments, 3 partici-
pants and 2 comment exchanges.
Since the presence of comments, participants
and/or comment exchanges could affect the acceptance
or rejection of a PR, we computed the proportion of Figure 3: Proportion of PR discussions w.r.t. the pres-
accepted (resp. rejected) PRs that have at least one ence of participants.

3
observe that the most frequent combinations of partic-
ipants involve the author and some integrator/decider. 500
Accepted
400

duration (in days)
For instance, the pair composed of author/integrator Rejected
is the most frequent one (40%=13+20+4+3) followed 300
by the pair author/integrator (39%=20+12+4+3). 200
24% (=20+4) of the discussions involve the author,
100
an integrator and the decider. 29% (=6+6+11+6) of
0
all cases involve a single participant only. 0 20 40 60 80 100 120 140
Similar to what was done for RQ1 , we grouped PRs number of comments
according to their decision, and we computed the pro-
portion of PRs with respect to the presence of partic- Figure 5: Scatter plot and density plots of discussion
ipants of each category. Fig. 4 reports on these pro- duration and number of comments.
portions.
number of comments and the duration. We statisti-
1.0 cally compared these distributions by means of Mann-
discussion with Whitney-U tests. The null hypothesis was rejected in
0.8 author decider
proportion of PRs

integrator other both cases (p < 0.001), indicating a statistically sig-
0.6 nificant difference between these distributions. How-
0.4 ever, we found this difference to be negligible (Cliff’s
delta |d| = 0.025) for the number of comments [30,31],
0.2
and small (|d| = 0.219) for the duration of these dis-
0.0 cussions, indicating a higher duration in rejected PRs
Accepted Rejected
than in accepted ones. For instance, the median dura-
tion is 1.69 days for rejected PRs and 0.6 for accepted
Figure 4: Proportion of PRs w.r.t. participants, ones.
grouped by PR decision. The two regression lines superposed on the scatter
We observe some interesting differences between ac- plot reflect the average time between comments (i.e.,
cepted and rejected PRs mainly based on the presence the ratio between duration and comments). We com-
of authors and integrators. 51.4% of rejected PRs in- puted this ratio for all considered discussions, and we
volve the author of that PR and 49.6% involve an in- statistically compared their distributions for accepted
tegrator, while for accepted PRs only 39.1% involve and rejected PRs using a Mann-Whitney-U test. We
the author and 34.3% involve an integrator. While in- found a statistically significant difference between the
tegrators are proportionally more involved in rejected two distributions (p < 0.001) and a small effect size
than accepted PRs, the opposite is true when it comes (|d| = 0.258), indicating a higher discussion ratio in
to the decider of a PR: a decider is involved in 42.6% accepted PRs than in rejected PRs. For instance, the
of accepted PRs but “only” in 22.0% of the rejected median average time between comments is 0.08 for ac-
ones. Finally, when considering all other participants cepted PRs, and 0.26 for rejected PRs.
there is only a slight difference between accepted PRs
(14.4%) and rejected PRs (17.4%). 5 Threats to Validity
Since our analyses are based on data from git reposi-
RQ3 How long are discussions?
tories on GitHub, our results may be exposed to the
The last research question focuses on the length of dis- usual threats related to mining data from GitHub such
cussions in terms of number of comments and time be- as “a large portion of repositories are not for soft-
tween the first and last comment. We computed these ware development” and “two thirds of projects are per-
two characteristics for discussions having at least 2 sonal” [28]. However, given that our dataset is com-
comments. These account for 49% of all PRs consid- posed of git repositories related to Cargo projects, it is
ered so far. The results are reported in Fig. 5, combin- unlikely to be affected by such threats. On the other
ing a scatter plot and two density plots (one for each hand, the selection bias induced by our dataset be-
considered characteristic). ing exclusively based on repositories related to Cargo
We observe from the density plots that most discus- projects is a threat to external validity [32], since the
sions have a few comments and last for a short period results and conclusions cannot be generalized outside
of time. For instance, the median number of com- the scope of this study.
ments is 5 and the median duration is 0.7 days. We The main threat to construct validity is that “most
observe from the scatter plot a difference between dis- pull requests appear as non-merged even if they are
cussions in accepted and rejected PRs, both for the actually merged” [28], potentially leading to an over-

4
estimation of the number of rejected PRs to the detri- This paper is part of a broader study and our inten-
ment of accepted ones. Fully addressing this threat tion is to gain a deeper understanding of the dynamics
is not possible, but we could rely on heuristics to de- and patterns of discussions in pull requests, and their
tect whether PR commits are actually part of the main impact on PR decisions. Our goal is to provide tech-
branch. Such heuristics are likely to change the figures niques and tools to allow the community to perform
reported in this paper, but are unlikely to affect the better. Reducing the time to make decisions for pull
findings we obtained. Indeed, even if some PRs were requests can help the community to encourage better
wrongly identified as non-merged (=rejected), we al- contributions by reducing the time required to reject
ready exhibited differences in PR discussions between contributions of insufficient quality or relevance, and
accepted and rejected PRs. by reducing the time to review and accept positive con-
Another threat to construct validity stems from the tributions. Moreover, based on the insights obtained
presence of bots and contributors with multiple iden- during this study we aim to develop techniques to in-
tities. We mitigated the problem of multiple identi- crease the productivity of contributions in terms of
ties by relying on GitHub usernames to identify con- code quality and contribution time.
tributors instead of the “author” field values. We did
not consider the presence of bots in this work. This References
may have led to an overestimation of the number of
comments and participants, but our findings should [1] Laura A. Dabbish, H. Colleen Stuart, Jason Tsay,
not be significantly affected, assuming that bots rep- and James D. Herbsleb. Social coding in GitHub:
resent only a fraction of the considered comments. In transparency and collaboration in an open soft-
our future work, we will study heuristics to detect bot ware repository. In Int’l Conf. Computer Sup-
comments in order to take them into account in our ported Cooperative Work, pages 1277–1286, 2012.
analyses. [2] Tom Mens, Marcelo Cataldo, and Daniela
Finally, the lack of distinction between the different Damian. The social developer: The future of soft-
types of comments in our dataset represents a threat ware development. IEEE Software, 36, January–
to internal validity. Not all comments are equal, but February 2019.
have been treated as such in this work. We did not
differentiate based on the size or content of the com- [3] Georgios Gousios, Martin Pinzger, and Arie van
ments. Similarly, we did not distinguish between PR Deursen. An exploratory study of the pull-based
comments and PR review comments, even if they do software development model. In International
not serve the same purpose. Making such distinctions Conference on Software Engineering, pages 345–
can potentially lead to different results, and will be 355. ACM, 2014.
explored in future work to gain additional insights.
[4] G. Gousios, A. Zaidman, M. Storey, and Arie van
Deursen. Work practices and challenges in pull-
6 Conclusion based development: The integrator’s perspective.
In this preliminary research, we empirically studied In International Conference on Software Engi-
183K PRs and their discussions, accounting for around neering, volume 1, pages 358–368. IEEE, May
1M comments. We showed that discussions are preva- 2015.
lent in PRs and there are proportionally more com-
[5] Georgios Gousios, Margaret-Anne Storey, and Al-
ments, participants and comment exchanges for re-
berto Bacchelli. Work practices and challenges in
jected PRs than for accepted ones. We identified and
pull-based development: The contributor’s per-
grouped participants based on their role in a PR, and
spective. In International Conference on Software
showed that a majority of discussions involved the au-
Engineering, pages 285–296. ACM, 2016.
thor, the decider or one of the integrators. We showed
that the presence of these participants is related to PR [6] Jason Tsay, Laura Dabbish, and James Herb-
decisions. sleb. Let’s talk about it: Evaluating contribu-
Finally, we considered discussion length in terms tions through discussion in github. In Proceedings
of duration and number of comments. We observed of the 22Nd ACM SIGSOFT International Sym-
that most discussions have only a few comments and posium on Foundations of Software Engineering,
do not last for long. While we have not found large FSE 2014, pages 144–154, New York, NY, USA,
differences between accepted and rejected PRs based 2014. ACM.
on their number of comments, we found that discus-
sions in rejected PRs are longer, and that discussions [7] Olga Baysal, Oleksii Kononenko, Reid Holmes,
in accepted PRs are more intense. and Michael W. Godfrey. Investigating techni-

5
cal and non-technical factors influencing mod- In Asia-Pacific Software Engineering Conference,
ern code review. Empirical Software Engineering, volume 1, pages 335–342, Dec 2014.
21(3):932–959, Jun 2016.
[17] Manoel Limeira de Lima Júnior, Daricélio Mor-
[8] Mohammad Masudur Rahman and Chanchal K. eira Soares, Alexandre Plastino, and Leonardo
Roy. An insight into the pull requests of Murta. Developers assignment for analyzing pull
GitHub. In Working Conference on Mining Soft- requests. In ACM Symposium on Applied Com-
ware Repositories, pages 364–367. ACM, 2014. puting, pages 1567–1572. ACM, 2015.

[9] Di Chen, Kathryn T. Stolee, and Tim Menzies. [18] Jing Jiang, J.-H He, and X.-Y Chen. Corede-
Replication can improve prior results: A github vrec: Automatic core member recommendation
study of pull request acceptance. In Proceedings for contribution evaluation. Journal of Computer
of the 27th International Conference on Program Science and Technology, 30:998–1016, 09 2015.
Comprehension, ICPC ’19, pages 179–190, Pis-
cataway, NJ, USA, 2019. IEEE Press. [19] Manoel Limeira de Lima Júnior, Daricélio Mor-
eira Soares, Alexandre Plastino, and Leonardo
[10] Y. Yu, H. Wang, V. Filkov, P. Devanbu, and Murta. Automatic assignment of integrators to
B. Vasilescu. Wait for it: Determinants of pull pull requests: The importance of selecting appro-
request evaluation latency on GitHub. In Work- priate attributes. Journal of Systems and Soft-
ing Conference on Mining Software Repositories, ware, 144:181 – 196, 2018.
pages 367–371, May 2015.
[20] E. v. d. Veen, G. Gousios, and A. Zaidman. Au-
[11] Jason Tsay, Laura Dabbish, and James Herbsleb. tomatically prioritizing pull requests. In Work-
Influence of social and technical factors for eval- ing Conference on Mining Software Repositories,
uating contribution in GitHub. In International pages 357–361. IEEE, May 2015.
Conference on Software Engineering, pages 356–
366. ACM, 2014. [21] Ayushi Rastogi, Nachiappan Nagappan, Georgios
Gousios, and André van der Hoek. Relationship
[12] Igor Steinmacher, Gustavo Pinto, Igor Scaliante between geographical location and evaluation of
Wiese, and Marco A. Gerosa. Almost there: A developer contributions in github. In Interna-
study on quasi-contributors in open source soft- tional Symposium on Empirical Software Engi-
ware projects. In Proceedings of the 40th Interna- neering and Measurement. ACM, 2018.
tional Conference on Software Engineering, ICSE
’18, pages 256–266, New York, NY, USA, 2018. [22] Josh Terrell, Andrew Kofink, Justin Middle-
ACM. ton, Clarissa Rainear, Emerson Murphy-Hill, and
Chris Parnin. Gender bias in open source: Pull
[13] M. Wessel, I. Steinmacher, I. Wiese, and M. A. request acceptance of women versus men. 01 2016.
Gerosa. Should i stale or should i close? an anal-
ysis of a bot that closes abandoned issues and [23] Felipe Ebert, Fernando Castor, Nicole Novielli,
pull requests. In 2019 IEEE/ACM 1st Interna- and Alexander Serebrenik. Confusion in code re-
tional Workshop on Bots in Software Engineering views: Reasons, impacts, and coping strategies.
(BotSE), pages 38–42, May 2019. pages 49–60, 02 2019.

[14] Damien Legay, Alexandre Decan, and Tom [24] Vasiliki Efstathiou and Diomidis Spinellis. Code
Mens. On the impact of pull request decisions review comments: Language matters. CoRR,
on future contributions. arXiv e-prints, page abs/1803.02205, 2018.
arXiv:1812.06269, Dec 2018.
[25] M. M. Rahman and C. K. Roy. Impact of con-
[15] Y. Yu, H. Wang, G. Yin, and C. X. Ling. Re- tinuous integration on code reviews. In 2017
viewer recommender of pull-requests in GitHub. IEEE/ACM 14th International Conference on
In International Conference on Software Mainte- Mining Software Repositories (MSR), pages 499–
nance and Evolution, pages 609–612. IEEE, Sep. 502, May 2017.
2014.
[26] L. MacLeod, M. Greiler, M. Storey, C. Bird, and
[16] Y. Yu, H. Wang, G. Yin, and C. X. Ling. Who J. Czerwonka. Code reviewing in the trenches:
should review this pull-request: Reviewer rec- Challenges and best practices. IEEE Software,
ommendation to expedite crowd collaboration. 35(4):34–42, July 2018.

6
[27] O. Kononenko, O. Baysal, and M. W. Godfrey. [30] N. Cliff. Dominance statistics: Ordinal analyses
Code review quality: How developers see it. In to answer ordinal questions. Psychological Bul-
2016 IEEE/ACM 38th International Conference letin, 114(3):494–509, 1993. cited By 364.
on Software Engineering (ICSE), pages 1028–
1038, May 2016. [31] Jeanine Romano, Jeffrey D Kromrey, Jesse Cor-
aggio, Jeff Skowronek, and Linda Devine. Explor-
[28] Eirini Kalliamvakou, Georgios Gousios, Kelly ing methods for evaluating group differences on
Blincoe, Leif Singer, Daniel M. German, and the NSSE and other surveys: Are the t-test and
Daniela Damian. The promises and perils of Cohen’s d indices the most appropriate choices?
mining GitHub. In Int’l Conf. Mining Software In Annual Meeting of the Southern Association
Repositories, pages 92–101. ACM, 2014. for Institutional Research, 2006.
[29] Jeremy Katz. Libraries.io open source repository [32] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson,
and dependency metadata (version 1.4.0) [data B. Regnell, and A. Wesslen. Experimentation in
set]. http://doi.org/10.5281/zenodo.2536573, Software Engineering - An Introduction. Kluwer,
2018. 2000.