=Paper=
{{Paper
|id=Vol-2047/BENEVOL_2017_paper_10
|storemode=property
|title=Untangling Source Code Changes Using Program Slicing
|pdfUrl=https://ceur-ws.org/Vol-2047/BENEVOL_2017_paper_10.pdf
|volume=Vol-2047
|authors=Ward Muylaert,Coen De Roover
|dblpUrl=https://dblp.org/rec/conf/benevol/MuylaertR17
}}
==Untangling Source Code Changes Using Program Slicing==
Untangling Source Code Changes
Using Program Slicing
Ward Muylaert Coen De Roover
ward.muylaert@vub.be coen.de.roover@vub.be
Software Languages Lab Software Languages Lab
Vrije Universiteit Brussel Vrije Universiteit Brussel
Brussels, Belgium Brussels, Belgium
Abstract—Version control systems (VCS) are widely used to many different changes in a tangled commit. A code reviewer
manage the history of code bases. These histories in turn provide will have a harder time understanding larger commits of
opportunities for research. Researchers expect the commits in unrelated changes [13]. This in turn will lead to lower quality
these version control systems to be atomic. That is, each commit
performs one task. This is however not always the case. To feedback [1]. A researcher interested in historical analysis,
remedy this, we propose a commit untangling technique using finally, will need to decide on the “one” function of a commit
program slicing. In particular, we posit that all related changes even though many unrelated changes may be present in the
are part of the same program slice. To do so, we perform program commit.
slicing on changes. Preliminary results using intra-procedural In light of these difficulties, we propose an automated
slicing have proven to be encouraging. We are currently working
on expanding our work to be inter-procedural. commit untangling technique. Our technique employs program
slicing around changes. Program slicing is a technique to
I. I NTRODUCTION answer questions about the influence of program statements
Version control systems (VCS) are widely used to manage on other program statements [15], [11]. We extend this idea.
the history of code bases. Prominent examples include Git, We posit that all related changes are part of the same program
SVN, or Mercurial. A developer may “save” their changes slice. Thus, a commit may be untangled by means of the
into units called commits. Best practice suggests each commit created slices. A preliminary implementation of our technique
should only contain changes related to one task. Such commits performs intra-procedural slicing. Despite this limitation, early
are called atomic commits [2], [14]. In this manner, the version results are encouraging. We are currently in the process of
control system can be used to keep track of how the program expanding the implementation to work on an inter-procedural
under development evolves. The version control system can level.
also be applied to, for example, revert individual changes
or port changes to other versions of the code base. On the II. A RCHITECTURE
research side, version control systems provide a trove of Our technique consists of four major parts. First, the commit
software evolution information open to analysis. is distilled into fine-grained changes to the program’s abstract
However, developers do not necessarily follow the best syntax tree (AST). Second, a program dependence graph of the
practice of creating only atomic commits [9]. For example, program is created. Third, a slice of the program dependence
a small bug may be quickly fixed while working on another graph is produced for every fine-grained change. Finally,
feature and placed in the same commit. Floss refactoring is changes are grouped by means of the slices and the commit is
another problem: refactoring in order to implement a new partitioned accordingly. We have implemented our technique
feature. These situations make for larger commits in which to work on Java programs. The rest of this section provides
many unrelated changes are tangled together. Such commits further detail into each of the four steps.
are called tangled commits. For the first step, we make use of ChangeNodes [12].
Tangled commits occur on a regular basis. A study by ChangeNodes works on the AST of a Java program. Given
Herzig et al. found that up to 15% of Java bug fixes contain two versions of a program, ChangeNodes provides a list of
tangled changes [3]. Tao et al. found that between 17% and Insert, Update, Move, and Delete operations. In our case the
29% of investigated revisions were tangled [14]. Nguyen et two versions are the version before and the version after the
al. found that 11% to 39% of all the fixing commits used for commit under analysis. Applying the obtained list of changes
mining archives were tangled [10]. to the AST of the earlier version results in the AST of the
Tangled commits lead to problems on many fronts. We pro- later version. ChangeNodes thus provides fine-grained changes
vide four examples. A developer will have problems reverting describing the commit.
particular changes if they are a part of a bigger commit. For the second step, we make use of TinyPDG [6], [7],
A developer may also have problems integrating particular [5]. TinyPDG creates a program dependence graph (PDG) of
changes from another developer if the desired change is part of a Java program. TinyPDG does this intra-procedurally. Our
1
36
main motivation for choosing TinyPDG is that it makes use of commits affecting just one method could be analysed. This
Eclipse libraries to create the underlying AST. ChangeNodes limits the number of commits that can be analysed, the large
also employs the Eclipse libraries for this purpose. This makes majority of commits cover more than one method. In the case
it more straightforward to link these two parts together. of the Jaxen project, no commits are left to be analysed. This
In the third step, our technique performs forward and limitation will be solved by our (current and) future work in
backward slicing on the program dependence graph. This is which we are expanding the implementation to work inter-
done for every distilled fine-grained change obtained in step procedurally. The setup of our evaluation will remain the same
one. Thus for every change ci a slice S(ci ) is obtained. for the inter-procedural evaluation.
We implemented the intra-procedural slicing algorithm as We apply our technique to every atomic and tangled commit
described by Horwitz et al. [8] on top of the PDG created by affecting just one method. We consider three separate out-
TinyPDG. The algorithm by Horwitz et al. performs backward comes for the analysis of a commit, regardless of it being
slicing. We adjusted the algorithm so that it also performs atomic or tangled.
forward slicing. When we refer to slicing, we consider the 1) The commit is correctly identified as atomic or tangled.
combination of forward and backward slicing. Our technique and the dataset agree on what kind of
Finally, our technique considers the following relation R commit it is.
between changes. Changes ci and cj are related (notation: 2) The commit is not correctly identified. Our technique
ci Rcj ) if and only if ci 2 S(cj ) _ cj 2 S(ci ). By definition identified the commit as atomic/tangled while the dataset
of how slicing works, this relation is reflexive. The relation calls the commit tangled/atomic respectively.
is also clearly symmetric due to its symmetric definition. The 3) No result. This happened when the memory overhead
relation is however not necessarily transitive. We cannot state made the analysis crash. We have not yet played around
that if ci Rcj and cj Rck , then ci Rck . Consider for this the with this overhead in order to solve or analyse it.
following simplified situation. cj is part of the root node of a
In the case of atomic commits, the results are promising.
PDG with two children. ci is part of one of the child nodes. ck
For each of the projects except JRuby, the ratio of correctly
is part of the other child node. Slicing in this situation results in
analysed atomic commits is larger than 90%. For JRuby nearly
S(ci ) = {ci , cj }, S(cj ) = {ci , cj , ck }, and S(ck ) = {cj , ck }.
20% of commits could not be analysed. Incorrectly identified
Then ci Rcj and cj Rck , but ¬(ci Rck ). The relation R is thus
commits happened largely through statements like throw or
not an equivalence relation. Instead, we partition the set of
catch not being supported by TinyPDG.
changes into subsets by means of the following steps.
Our technique falters on certain types of tangled com-
1) If a change is not in relation with any change in any
mits. Except for the GWT project, more tangled commits
of the existing subsets, create a new subset with that
were incorrectly identified as atomic than they were correctly
change in it.
identified as tangled. However, the large majority of these
2) If a change is in a relation with (an) element(s) of exactly
incorrect identifications are due to formatting changes or
one existing subset, place the change in that subset.
changes to comments. Our technique does not take formatting
3) If a change is in a relation with two (or more) elements
nor comments into account, only code. As such, a commit
of different subsets, join the subsets together and add
handling both one code related task and one formatting task
the change to it.
would be classified as tangled in the dataset, but identified as
The last step fakes transitivity and effectively “widens” the atomic by our technique.
relation R: more changes are considered related than they
would be by our original definition of R. In terms of these IV. F UTURE W ORK
subsets, we rephrase our hypothesis as: A commit is atomic
if and only if our technique does not split up the commit into We are currently in the process of making the implementa-
different subsets of changes. tion of our technique inter-procedural. To do so, we need to
make two major changes.
III. E VALUATION First, our program dependence graph needs to consider the
To evaluate our hypothesis, we make use of a dataset of entire program. Rather than a procedure dependence graph, we
five Java programs as used by Herzig et al. in [3], [4]. The require a system dependence graph (SDG) or class dependence
programs in question are: ArgoUML, GWT, Jaxen, JRuby, graph (ClDG). To achieve this, we need to extend TinyPDG to
and Xstream. These projects were chosen for meeting certain work inter-procedurally. The main hurdle here is creating the
quality criteria. For each of the projects, Herzig et al. manually summary edges via the algorithm as described in [8]. These
identified atomic commits using commit and issue information. are necessary to avoid the calling context problem as described
Using the atomic commits, Herzig et al. also created artificial there.
tangled commits for each of the projects. Second, our slicer needs to be adjusted to work on the inter-
We used this dataset to perform a preliminary evaluation of procedural system dependence graph. These adjustments are
our hypothesis and technique. This preliminary evaluation has also described by [8].
one obvious limitation. Due to the current implementation of Once our implementation works inter-procedurally, we will
our technique being intra-procedural, only atomic and tangled perform the evaluation described in section III again.
2
37
V. C ONCLUSION
We wanted to untangle commits performing more than
one task. For this, we considered the hypothesis that related
changes belong to one and the same program slice. We created
a first implementation to test this hypothesis. This implemen-
tation works intra-procedurally. Despite this limitation, initial
results are promising. Incorrect identifications happen largely
either due to parts of the program our implementation does
not handle (e.g., throw statements) or due to formatting and
comment changes which our technique does not consider. We
are currently working on expanding the implementation of our
technique to work inter-procedurally. Once this is done, we
will redo the evaluation with the complete dataset.
ACKNOWLEDGMENT
Ward Muylaert is an SB PhD fellow at FWO, project
number 1S64317N.
R EFERENCES
[1] A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges
of modern code review,” in International Conference on Software
Engineering (ICSE), 2013.
[2] M. Dias, A. Bacchelli, G. Gousios, D. Cassou, and S. Ducasse, “Un-
tangling fine-grained code changes,” in International Conference on
Software Analysis, Evolution, and Reengineering (SANER), 2015.
[3] K. Herzig, S. Just, and A. Zeller, “The impact of tangled code changes
on defect prediction models,” Empirical Software Engineering, vol. 21,
no. 2, pp. 303–336, Apr. 2015.
[4] K. Herzig and A. Zeller, “Untangling changes,” 2011.
[5] Y. Higo. Tinypdg: A library for building intraprocedural pdgs for java
programs. [Online]. Available: https://github.com/YoshikiHigo/TinyPDG
[6] Y. Higo and S. Kusumoto, “Enhancing quality of code clone detection
with program dependency graph,” in Working Conference on Reverse
Engineering, 2009.
[7] ——, “Code clone detection on specialized pdgs with heuristics,” in
European Conference on Software Maintenance and Reegineering, 2011.
[8] S. Horwitz, T. Reps, and D. Binkley, “Interprocedural slicing using
dependence graphs,” ACM Transactions on Programming Languages
and Systems, vol. 12, no. 1, pp. 26–60, Jan. 1990.
[9] M. Konopka and P. Navrat, “Untangling development tasks with soft-
ware developer’s activity,” in 2015 IEEE/ACM 2nd International Work-
shop on Context for Software Development, May 2015, pp. 13–14.
[10] H. A. Nguyen, A. T. Nguyen, T. N. Nguyen, Electrical, Computer, and
E. Department, “Filtering noise in mixed-purpose fixing commits to
improve defect prediction and localization,” in International Symposium
on Software Reliability Engineering (ISSRE), 2013.
[11] J. Silva, “A vocabulary of program slicing-based techniques,” ACM
Computing Surveys, vol. 44, no. 3, Jun. 2012.
[12] R. Stevens and C. De Roover, “Extracting executable transformations
from distilled code changes,” in International Conference on Software
Analysis, Evolution and Reengineering, 2017.
[13] Y. Tao, Y. Dang, T. Xie, D. Zhang, and S. Kim, “How do software
engineers understand code changes? - an exploratory study in industry,”
in International Symposium on the Foundations of Software Engineering
(FSE), 2012.
[14] Y. Tao and S. Kim, “Partitioning composite code changes to facilitate
code review,” in International Conference on Mining Software Reposi-
tories (MSR), 2015.
[15] M. Weiser, “Program slicing,” in International Conference on Software
Engineering (ICSE), 1981, pp. 439–449.
3
38