=Paper= {{Paper |id=Vol-2047/BENEVOL_2017_paper_10 |storemode=property |title=Untangling Source Code Changes Using Program Slicing |pdfUrl=https://ceur-ws.org/Vol-2047/BENEVOL_2017_paper_10.pdf |volume=Vol-2047 |authors=Ward Muylaert,Coen De Roover |dblpUrl=https://dblp.org/rec/conf/benevol/MuylaertR17 }} ==Untangling Source Code Changes Using Program Slicing== https://ceur-ws.org/Vol-2047/BENEVOL_2017_paper_10.pdf
                       Untangling Source Code Changes
                            Using Program Slicing
                               Ward Muylaert                                           Coen De Roover
                           ward.muylaert@vub.be                                     coen.de.roover@vub.be
                          Software Languages Lab                                   Software Languages Lab
                          Vrije Universiteit Brussel                               Vrije Universiteit Brussel
                              Brussels, Belgium                                       Brussels, Belgium



   Abstract—Version control systems (VCS) are widely used to            many different changes in a tangled commit. A code reviewer
manage the history of code bases. These histories in turn provide       will have a harder time understanding larger commits of
opportunities for research. Researchers expect the commits in           unrelated changes [13]. This in turn will lead to lower quality
these version control systems to be atomic. That is, each commit
performs one task. This is however not always the case. To              feedback [1]. A researcher interested in historical analysis,
remedy this, we propose a commit untangling technique using             finally, will need to decide on the “one” function of a commit
program slicing. In particular, we posit that all related changes       even though many unrelated changes may be present in the
are part of the same program slice. To do so, we perform program        commit.
slicing on changes. Preliminary results using intra-procedural             In light of these difficulties, we propose an automated
slicing have proven to be encouraging. We are currently working
on expanding our work to be inter-procedural.                           commit untangling technique. Our technique employs program
                                                                        slicing around changes. Program slicing is a technique to
                      I. I NTRODUCTION                                  answer questions about the influence of program statements
   Version control systems (VCS) are widely used to manage              on other program statements [15], [11]. We extend this idea.
the history of code bases. Prominent examples include Git,              We posit that all related changes are part of the same program
SVN, or Mercurial. A developer may “save” their changes                 slice. Thus, a commit may be untangled by means of the
into units called commits. Best practice suggests each commit           created slices. A preliminary implementation of our technique
should only contain changes related to one task. Such commits           performs intra-procedural slicing. Despite this limitation, early
are called atomic commits [2], [14]. In this manner, the version        results are encouraging. We are currently in the process of
control system can be used to keep track of how the program             expanding the implementation to work on an inter-procedural
under development evolves. The version control system can               level.
also be applied to, for example, revert individual changes
or port changes to other versions of the code base. On the                                   II. A RCHITECTURE
research side, version control systems provide a trove of                  Our technique consists of four major parts. First, the commit
software evolution information open to analysis.                        is distilled into fine-grained changes to the program’s abstract
   However, developers do not necessarily follow the best               syntax tree (AST). Second, a program dependence graph of the
practice of creating only atomic commits [9]. For example,              program is created. Third, a slice of the program dependence
a small bug may be quickly fixed while working on another               graph is produced for every fine-grained change. Finally,
feature and placed in the same commit. Floss refactoring is             changes are grouped by means of the slices and the commit is
another problem: refactoring in order to implement a new                partitioned accordingly. We have implemented our technique
feature. These situations make for larger commits in which              to work on Java programs. The rest of this section provides
many unrelated changes are tangled together. Such commits               further detail into each of the four steps.
are called tangled commits.                                                For the first step, we make use of ChangeNodes [12].
   Tangled commits occur on a regular basis. A study by                 ChangeNodes works on the AST of a Java program. Given
Herzig et al. found that up to 15% of Java bug fixes contain            two versions of a program, ChangeNodes provides a list of
tangled changes [3]. Tao et al. found that between 17% and              Insert, Update, Move, and Delete operations. In our case the
29% of investigated revisions were tangled [14]. Nguyen et              two versions are the version before and the version after the
al. found that 11% to 39% of all the fixing commits used for            commit under analysis. Applying the obtained list of changes
mining archives were tangled [10].                                      to the AST of the earlier version results in the AST of the
   Tangled commits lead to problems on many fronts. We pro-             later version. ChangeNodes thus provides fine-grained changes
vide four examples. A developer will have problems reverting            describing the commit.
particular changes if they are a part of a bigger commit.                  For the second step, we make use of TinyPDG [6], [7],
A developer may also have problems integrating particular               [5]. TinyPDG creates a program dependence graph (PDG) of
changes from another developer if the desired change is part of         a Java program. TinyPDG does this intra-procedurally. Our




                                                                    1

                                                                                                                                        36
main motivation for choosing TinyPDG is that it makes use of                  commits affecting just one method could be analysed. This
Eclipse libraries to create the underlying AST. ChangeNodes                   limits the number of commits that can be analysed, the large
also employs the Eclipse libraries for this purpose. This makes               majority of commits cover more than one method. In the case
it more straightforward to link these two parts together.                     of the Jaxen project, no commits are left to be analysed. This
   In the third step, our technique performs forward and                      limitation will be solved by our (current and) future work in
backward slicing on the program dependence graph. This is                     which we are expanding the implementation to work inter-
done for every distilled fine-grained change obtained in step                 procedurally. The setup of our evaluation will remain the same
one. Thus for every change ci a slice S(ci ) is obtained.                     for the inter-procedural evaluation.
We implemented the intra-procedural slicing algorithm as                         We apply our technique to every atomic and tangled commit
described by Horwitz et al. [8] on top of the PDG created by                  affecting just one method. We consider three separate out-
TinyPDG. The algorithm by Horwitz et al. performs backward                    comes for the analysis of a commit, regardless of it being
slicing. We adjusted the algorithm so that it also performs                   atomic or tangled.
forward slicing. When we refer to slicing, we consider the                       1) The commit is correctly identified as atomic or tangled.
combination of forward and backward slicing.                                         Our technique and the dataset agree on what kind of
   Finally, our technique considers the following relation R                         commit it is.
between changes. Changes ci and cj are related (notation:                        2) The commit is not correctly identified. Our technique
ci Rcj ) if and only if ci 2 S(cj ) _ cj 2 S(ci ). By definition                     identified the commit as atomic/tangled while the dataset
of how slicing works, this relation is reflexive. The relation                       calls the commit tangled/atomic respectively.
is also clearly symmetric due to its symmetric definition. The                   3) No result. This happened when the memory overhead
relation is however not necessarily transitive. We cannot state                      made the analysis crash. We have not yet played around
that if ci Rcj and cj Rck , then ci Rck . Consider for this the                      with this overhead in order to solve or analyse it.
following simplified situation. cj is part of the root node of a
                                                                                 In the case of atomic commits, the results are promising.
PDG with two children. ci is part of one of the child nodes. ck
                                                                              For each of the projects except JRuby, the ratio of correctly
is part of the other child node. Slicing in this situation results in
                                                                              analysed atomic commits is larger than 90%. For JRuby nearly
S(ci ) = {ci , cj }, S(cj ) = {ci , cj , ck }, and S(ck ) = {cj , ck }.
                                                                              20% of commits could not be analysed. Incorrectly identified
Then ci Rcj and cj Rck , but ¬(ci Rck ). The relation R is thus
                                                                              commits happened largely through statements like throw or
not an equivalence relation. Instead, we partition the set of
                                                                              catch not being supported by TinyPDG.
changes into subsets by means of the following steps.
                                                                                 Our technique falters on certain types of tangled com-
   1) If a change is not in relation with any change in any
                                                                              mits. Except for the GWT project, more tangled commits
       of the existing subsets, create a new subset with that
                                                                              were incorrectly identified as atomic than they were correctly
       change in it.
                                                                              identified as tangled. However, the large majority of these
   2) If a change is in a relation with (an) element(s) of exactly
                                                                              incorrect identifications are due to formatting changes or
       one existing subset, place the change in that subset.
                                                                              changes to comments. Our technique does not take formatting
   3) If a change is in a relation with two (or more) elements
                                                                              nor comments into account, only code. As such, a commit
       of different subsets, join the subsets together and add
                                                                              handling both one code related task and one formatting task
       the change to it.
                                                                              would be classified as tangled in the dataset, but identified as
   The last step fakes transitivity and effectively “widens” the              atomic by our technique.
relation R: more changes are considered related than they
would be by our original definition of R. In terms of these                                        IV. F UTURE W ORK
subsets, we rephrase our hypothesis as: A commit is atomic
if and only if our technique does not split up the commit into                   We are currently in the process of making the implementa-
different subsets of changes.                                                 tion of our technique inter-procedural. To do so, we need to
                                                                              make two major changes.
                        III. E VALUATION                                         First, our program dependence graph needs to consider the
   To evaluate our hypothesis, we make use of a dataset of                    entire program. Rather than a procedure dependence graph, we
five Java programs as used by Herzig et al. in [3], [4]. The                  require a system dependence graph (SDG) or class dependence
programs in question are: ArgoUML, GWT, Jaxen, JRuby,                         graph (ClDG). To achieve this, we need to extend TinyPDG to
and Xstream. These projects were chosen for meeting certain                   work inter-procedurally. The main hurdle here is creating the
quality criteria. For each of the projects, Herzig et al. manually            summary edges via the algorithm as described in [8]. These
identified atomic commits using commit and issue information.                 are necessary to avoid the calling context problem as described
Using the atomic commits, Herzig et al. also created artificial               there.
tangled commits for each of the projects.                                        Second, our slicer needs to be adjusted to work on the inter-
   We used this dataset to perform a preliminary evaluation of                procedural system dependence graph. These adjustments are
our hypothesis and technique. This preliminary evaluation has                 also described by [8].
one obvious limitation. Due to the current implementation of                     Once our implementation works inter-procedurally, we will
our technique being intra-procedural, only atomic and tangled                 perform the evaluation described in section III again.




                                                                          2

                                                                                                                                             37
                           V. C ONCLUSION
   We wanted to untangle commits performing more than
one task. For this, we considered the hypothesis that related
changes belong to one and the same program slice. We created
a first implementation to test this hypothesis. This implemen-
tation works intra-procedurally. Despite this limitation, initial
results are promising. Incorrect identifications happen largely
either due to parts of the program our implementation does
not handle (e.g., throw statements) or due to formatting and
comment changes which our technique does not consider. We
are currently working on expanding the implementation of our
technique to work inter-procedurally. Once this is done, we
will redo the evaluation with the complete dataset.
                        ACKNOWLEDGMENT
  Ward Muylaert is an SB PhD fellow at FWO, project
number 1S64317N.
                             R EFERENCES
 [1] A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges
     of modern code review,” in International Conference on Software
     Engineering (ICSE), 2013.
 [2] M. Dias, A. Bacchelli, G. Gousios, D. Cassou, and S. Ducasse, “Un-
     tangling fine-grained code changes,” in International Conference on
     Software Analysis, Evolution, and Reengineering (SANER), 2015.
 [3] K. Herzig, S. Just, and A. Zeller, “The impact of tangled code changes
     on defect prediction models,” Empirical Software Engineering, vol. 21,
     no. 2, pp. 303–336, Apr. 2015.
 [4] K. Herzig and A. Zeller, “Untangling changes,” 2011.
 [5] Y. Higo. Tinypdg: A library for building intraprocedural pdgs for java
     programs. [Online]. Available: https://github.com/YoshikiHigo/TinyPDG
 [6] Y. Higo and S. Kusumoto, “Enhancing quality of code clone detection
     with program dependency graph,” in Working Conference on Reverse
     Engineering, 2009.
 [7] ——, “Code clone detection on specialized pdgs with heuristics,” in
     European Conference on Software Maintenance and Reegineering, 2011.
 [8] S. Horwitz, T. Reps, and D. Binkley, “Interprocedural slicing using
     dependence graphs,” ACM Transactions on Programming Languages
     and Systems, vol. 12, no. 1, pp. 26–60, Jan. 1990.
 [9] M. Konopka and P. Navrat, “Untangling development tasks with soft-
     ware developer’s activity,” in 2015 IEEE/ACM 2nd International Work-
     shop on Context for Software Development, May 2015, pp. 13–14.
[10] H. A. Nguyen, A. T. Nguyen, T. N. Nguyen, Electrical, Computer, and
     E. Department, “Filtering noise in mixed-purpose fixing commits to
     improve defect prediction and localization,” in International Symposium
     on Software Reliability Engineering (ISSRE), 2013.
[11] J. Silva, “A vocabulary of program slicing-based techniques,” ACM
     Computing Surveys, vol. 44, no. 3, Jun. 2012.
[12] R. Stevens and C. De Roover, “Extracting executable transformations
     from distilled code changes,” in International Conference on Software
     Analysis, Evolution and Reengineering, 2017.
[13] Y. Tao, Y. Dang, T. Xie, D. Zhang, and S. Kim, “How do software
     engineers understand code changes? - an exploratory study in industry,”
     in International Symposium on the Foundations of Software Engineering
     (FSE), 2012.
[14] Y. Tao and S. Kim, “Partitioning composite code changes to facilitate
     code review,” in International Conference on Mining Software Reposi-
     tories (MSR), 2015.
[15] M. Weiser, “Program slicing,” in International Conference on Software
     Engineering (ICSE), 1981, pp. 439–449.




                                                                               3

                                                                                   38